# Sololearn: Python for Data Science

### Measures of Central Tendancy

Mean is the average value of the dataset. Median is the middle value of an ordered dataset<br>
Median is more useful than mean because mean can vary widely due to one value that is lot larger/smaller than others.

### Standard Deviation

Standard deviation is a measure of how spread the data is. Standard deviation  is the square root of variance.<br>
If standard deviation is 17.1 and mean is 33.1, values that are within one standard deviation is said to be between (33.1-17.1) and (33.1+17.1)<br>
Low standard deviation means values tend to be close to the mean and vise versa

### NumPy (Numerical Python)

To use NumPy, first import the library

In [1]:
import numpy as np

### NumPy Arrays

NumPy arrays are faster and more compact than lists. They are homogeneous, meaning only single data type elements can be stored. NumPy array can be created as follows

In [2]:
x = np.array([1, 2, 3, 4])

Their elements can be accessed using indexes

In [3]:
print(x[0])

1


NumPy arrays are called ndarrays, meaning ‘N-dimensional arrays’, because they can have multiple dimensions

In [4]:
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(x[1][2])

6


This is a 2d array which has 3 rows and 3 columns

### NumPy Array Properties

In [5]:
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(x.ndim)
print(x.size)
print(x.shape)

2
9
(3, 3)


ndim returns number of dimensions of array. size returns total number of elements of array.<br>
shape returns a tuple of integers that indicate the number of elements stored in each dimension of the array. Here, (3, 3) means 3 rows and 3 columns in this 2d array.<br>
We can add, remove and sort using np.append(), np.delete(), and np.sort()

In [6]:
import numpy as np

x = np.array([2, 1, 3])

x = np.append(x, 4)
x = np.delete(x, 0)
x = np.sort(x)

print(x)

[1 3 4]


np.delete() deletes the element of the given index and np.append() appends the given element<br>
np.arange() creates an array that has a range of numbers similar to python range

In [8]:
import numpy as np

x = np.arange(2, 10, 3)
print(x)

[2 5 8]


### Reshape

In [9]:
import numpy as np

x = np.arange(1, 7)

z = x.reshape(3, 2)

print(z) 

[[1 2]
 [3 4]
 [5 6]]


Here, 1d array containing 6 elements is reshaped to 2d array with 3 rows and 2 columns. Number of elements should be same at reshaping. <br>
Reshape can also do the opposite

In [10]:
import numpy as np

x = np.array([[1, 2], [3, 4], [5, 6]])

z = x.reshape(6)

print(z)

[1 2 3 4 5 6]


Here, 2d array is reshaped to a 1d array. This flat array can also be achieved using flatten() function

### Indexing and Slicing

Can be indexed and sliced similar to Python lists

In [11]:
import numpy as np

x = np.arange(1, 10)

print(x[0:2])
print(x[5:])
print(x[:2])
print(x[-3:])

[1 2]
[6 7 8 9]
[1 2]
[7 8 9]


### Conditions

Here, elements that are less than 4 are selected

In [12]:
import numpy as np

x = np.arange(1, 10)

print(x[x<4]) #[1 2 3]

[1 2 3]


& and | can be used with conditions

In [13]:
import numpy as np

x = np.arange(1, 10)

print(x[(x>5) & (x%2==0)])

[6 8]


### Operations

sum() is used to find sum of all elements

In [14]:
import numpy as np

x = np.arange(1, 10)
print(x.sum())

45


min(), max() is used to find smallest and largest elements

### Broadcasting

Performing given operation with each element is called broadcasting

In [15]:
import numpy as np

x = np.arange(1, 10)
y = x*2

print(y)

[ 2  4  6  8 10 12 14 16 18]


### Statistics with NumPy
NumPy has direct dunctions for statistics

In [16]:
import numpy as np

x = np.array([14, 18, 19, 24, 26, 33, 42, 55, 67])

print(np.mean(x))
print(np.median(x))
print(np.var(x))
print(np.std(x))

33.111111111111114
26.0
292.5432098765432
17.10389458212787


### Pandas (Panel Data)

Pandas is used to read, extract data from files and transform, analyze, calculate statistics, etc<br>
First, import pandas to use

In [None]:
import pandas as pd

### Series and DataFrames

A series is a column while dataframe is a multi-dimensional table consisting of a collection of series<br>
Series is  a 1d array and dataframe is a multi-dimensional array

### Creating a DataFrame
Let’s use a dictionary to create a dataframe

In [17]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data)
print(df)

   ages  heights
0    14      165
1    18      180
2    24      176
3    42      184


This dataframe has 2 columns. Dataframe automatically creates a numeric index for each row. We can specify a custom index.

In [18]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])
print(df)

       ages  heights
James    14      165
Bob      18      180
Amy      24      176
Dave     42      184


We can access a row using its index and loc[] function

In [19]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])
print(df.loc["Bob"])

ages        18
heights    180
Name: Bob, dtype: int64


This will show data related to Bob such as age and height.

### Indexing

We can print a column (series) using column name

In [20]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])

print(df["ages"])

James    14
Bob      18
Amy      24
Dave     42
Name: ages, dtype: int64


To print multiple columns (dataframe), specify column names

In [3]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])

print(df[["ages", "heights"]])

       ages  heights
James    14      165
Bob      18      180
Amy      24      176
Dave     42      184


### Slicing

Uses iloc function to slice data based on numeric index. This is similar to indexing lists in Python.

In [22]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])

# third row
print(df.iloc[2])

#first 3 rows
print(df.iloc[:3])

# rows 2 to 3
print(df.iloc[1:3])

ages        24
heights    176
Name: Amy, dtype: int64
       ages  heights
James    14      165
Bob      18      180
Amy      24      176
     ages  heights
Bob    18      180
Amy    24      176


### Conditions
Data can be selected based on conditions

In [None]:
import pandas as pd

data = {
   'ages': [14, 18, 24, 42],
   'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])

print(df[(df['ages']>18) & (df['heights']>180)])

Similarly, | can be used<br>
In the below example, rank column data is used to print name data 

In [None]:
import pandas as pd

data = {
   'name': ['James', 'Billy', 'Bob', 'Amy', 'Tom', 'Harry'],
   'rank': [4, 1, 3, 5, 2, 6]
}

df = pd.DataFrame(data, index=data['name'])
user_rank = int(input())
print(df['name'][df['rank'] == user_rank])

### Reading Data
read_csv() function is used to read CSV file data into a DataFrame

In [None]:
df = pd.read_csv("ca-covid.csv")

Pandas also support JSON and SQL databases<br>
first 5 row data is given by head() function of the DataFrame

In [2]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

print(df.head())

       date       state  cases  deaths
0  25.01.20  California      1       0
1  26.01.20  California      1       0
2  27.01.20  California      0       0
3  28.01.20  California      0       0
4  29.01.20  California      0       0


df.head(10) will return first 10 rows. Likewise, the number of rows needed to print can be specified<br>
Similarly, last rows is given by tail() function<br>
info() is used to get information about the dataset such as number of rows, columns, data types, etc

In [3]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 342 entries, 0 to 341
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    342 non-null    object
 1   state   342 non-null    object
 2   cases   342 non-null    int64 
 3   deaths  342 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 10.8+ KB


Here, this dataset has an auto-generated index. We can set our own index using set_index() function

In [6]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
df.set_index("date", inplace=True)

print(df.head())

               state  cases  deaths
date                               
25.01.20  California      1       0
26.01.20  California      1       0
27.01.20  California      0       0
28.01.20  California      0       0
29.01.20  California      0       0


Date column is a good index since there's each row for each date<br>
inplace = True means change will be applied to our DataFrame without the need of assigning it to a new variable

### Dropping a Column

In [9]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
df.set_index('date', inplace=True)
df.drop('state', axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 342 entries, 25.01.20 to 31.12.20
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cases   342 non-null    int64
 1   deaths  342 non-null    int64
dtypes: int64(2)
memory usage: 8.0+ KB


Now, there are only 2 columns since one of them is used as an index and other one is dropped. <br>
axis = 1 drops a column while axis = 0 drops a row

### Creating Columns
Here, month columns is created based on date column

In [1]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)

df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name()

df.set_index('date', inplace=True)

print(df.head())

          cases  deaths    month
date                            
25.01.20      1       0  January
26.01.20      1       0  January
27.01.20      0       0  January
28.01.20      0       0  January
29.01.20      0       0  January
