# Python Data Analytic Workshop
--- 
In this notebook, we are going to look at doing data analytics using Python and packages, including Pandas, SciKit-Learn, Matplotlib, Statsmodel. 

# Power of Google & Online Resources
There are a LOT of online resource available for us to get started in Python. And Google search has come a long way to accomodate for our dare need. When you search any simple Python related questions, Google is able to give you answers from Q&A sites such as StackOverflow. 

Package documentation is also a very useful tool. Any commonly used package comes equipped with a comprehensive documentation on every function and link to its source code. This is the source of truth for us. Learning how to read documentations will help you clearify any additonal question. 

# Introduction to Pandas 
Pandas is the essential tool to read in any table-like or time-series data, such as a csv or excel file.

In [23]:
import pandas as pd
import numpy as np

In [19]:
#Load dataset
df_iris = pd.read_csv('iris.csv')

In [5]:
df_iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


## DataFrames & Series
A DataFrame is the Pandas' two-dimensional data structure to store tabular data. A Pandas series is a one-dimensional array with labels. 

In [20]:
type(df_iris)

pandas.core.frame.DataFrame

### Accessing a dataframe

In [11]:
# by labels:
df_iris['sepal.length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
149    5.9
150    5.9
151    6.0
152    6.7
153    NaN
Name: sepal.length, Length: 154, dtype: float64

In [22]:
# by attributes
df_iris.variety

0          Setosa
1          Setosa
2          Setosa
3          Setosa
4          Setosa
          ...    
149     Virginica
150     Virginica
151    Versicolor
152     Virginica
153        Setosa
Name: variety, Length: 154, dtype: object

In [13]:
# access a row by index:
df_iris.iloc[0]

sepal.length       5.1
sepal.width        3.5
petal.length       1.4
petal.width        0.2
variety         Setosa
Name: 0, dtype: object

In [18]:
# access multiple rows by index (slicing ranges):
df_iris.iloc[0:5]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [17]:
# access columns by index:
df_iris.iloc[:,3]

0      0.2
1      0.2
2      0.2
3      0.2
4      0.2
      ... 
149    1.8
150    1.8
151    1.6
152    2.3
153    0.4
Name: petal.width, Length: 154, dtype: float64

### Adding rows or columns

In [27]:
# add a new column:
new_col = np.zeros(df_iris.shape[0])

In [28]:
new_col

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.])

In [29]:
df_iris['new_col'] = new_col

In [30]:
df_iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,new_col
0,5.1,3.5,1.4,0.2,Setosa,0.0
1,4.9,3.0,1.4,0.2,Setosa,0.0
2,4.7,3.2,1.3,0.2,Setosa,0.0
3,4.6,3.1,1.5,0.2,Setosa,0.0
4,5.0,3.6,1.4,0.2,Setosa,0.0


We typically don't add row(s) directly to a dataframe. The more common alternative is to concatenate two or more dataframe with identical columns together using concat() function

In [32]:
# concatenate a few dataframe into a new dataframe:
df_iris2 = df_iris.copy()
df_iris3 = df_iris.copy()
df_iris_concat = pd.concat([df_iris, df_iris2, df_iris3])
df_iris_concat.shape

(462, 6)

### Deleting rows or columns
Deleting something from dataframe is referred to as "dropping" something from dataframe. 

In [33]:
# Dropping a row:
## This does not reflect change on the original DF! 
#df_iris.drop(['new_col'], axis=1)
df_iris.drop(columns=['new_col'])

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
149,5.9,3.0,5.1,1.8,Virginica
150,5.9,,5.1,1.8,Virginica
151,6.0,3.4,,1.6,Versicolor
152,6.7,3.0,,2.3,Virginica


In [35]:
df_iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,new_col
0,5.1,3.5,1.4,0.2,Setosa,0.0
1,4.9,3.0,1.4,0.2,Setosa,0.0
2,4.7,3.2,1.3,0.2,Setosa,0.0
3,4.6,3.1,1.5,0.2,Setosa,0.0
4,5.0,3.6,1.4,0.2,Setosa,0.0


In [36]:
# either assign the dropped function to original df
df_iris = df_iris.drop(columns=['new_col'])
df_iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [38]:
# or use inplace=True argument in drop() function
df_iris['new_col'] = new_col
df_iris.drop(columns=['new_col'], inplace=True)
df_iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


# A typical data analysis work flow

## Exploratory

### Overview of the data

In [3]:
df_iris.describe()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,153.0,153.0,152.0,154.0
mean,5.850327,3.061438,3.751974,1.207792
std,0.822874,0.433335,1.766539,0.762395
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.575,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.4,5.1,1.8
max,7.9,4.4,6.9,2.5


### Missing data:
Missing data in Pandas is represented by nan which is a numpy special value. 

In [6]:
df_iris.isna().sum()

sepal.length    1
sepal.width     1
petal.length    2
petal.width     0
variety         0
dtype: int64

In [52]:
df_iris[df_iris.isna().any(axis=1)]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
150,5.9,,5.1,1.8,Virginica
151,6.0,3.4,,1.6,Versicolor
152,6.7,3.0,,2.3,Virginica
153,,3.4,1.5,0.4,Setosa


## Preprocessing
Preprocessing a dataset gets it ready for our next steps. Usually, for a numerial dataset, we want to clean it as much as we can based on our need, so that we don't have to change the data when we are doing the statistical testings and/or model building. 

Some of the most common preprocessing steps are: 
1. Mismatched data type within a column -> use casting to convert data to the correct type
2. Outliers -> drop using Z-score
3. Missing data -> fill or drop? 

For machine learning, several more steps are needed:

4. Class balance -> make sure we have equal amount of data for all classes
5. Data normalization -> scale data so that they are on the same scale

## Most common: Dealing with missing value (Fill or Drop)

In [54]:
df_iris.fillna(0).tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
149,5.9,3.0,5.1,1.8,Virginica
150,5.9,0.0,5.1,1.8,Virginica
151,6.0,3.4,0.0,1.6,Versicolor
152,6.7,3.0,0.0,2.3,Virginica
153,0.0,3.4,1.5,0.4,Setosa


In [55]:
df_iris.dropna().tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


## Statistical Testings

## Modeling