# Python Data Analysis

The **target** value or **label** is the value that we would like to predict using the other variables.

In [41]:
import pandas as pd
import numpy as np
df = pd.read_csv("assets/sales.csv")
df.head()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Australia and Oceania,Tuvalu,Baby Food,Offline,H,5/28/2010,669165933,6/27/2010,9925,255.28,159.42,2533654.0,1582243.5,951410.5
1,Central America and the Caribbean,Grenada,Cereal,Online,C,8/22/2012,963881480,9/15/2012,2804,205.7,117.11,576782.8,328376.44,248406.36
2,Europe,Russia,Office Supplies,Offline,L,5/2/2014,341417157,5/8/2014,1779,651.21,524.96,1158502.59,933903.84,224598.75
3,Sub-Saharan Africa,Sao Tome and Principe,Fruits,Online,C,6/20/2014,514321792,7/5/2014,8102,9.33,6.92,75591.66,56065.84,19525.82
4,Sub-Saharan Africa,Rwanda,Office Supplies,Offline,L,2/1/2013,115456712,2/6/2013,5062,651.21,524.96,3296425.02,2657347.52,639077.5


## Pre-Processing

After the data has been loaded, we then apply an assortment of techniques to prepare the raw data for analysis. This includes: dealing with missing values, data formatting and normalization/scaling.

#### Dealing With Missing Values

Missing values are common. First, check with the collection source to see whether the missing data can be obtained or estimated. If not, there are several options:

* Replace the missing value:
    * With the average (of similar data points)
    * With the mode, if it is a categorical variable
    * Using another estimation technique
* Or, you may choose to remove the missing data:
    * Either by dropping the data entry (row): ```axis = 0```
    * Or by dropping the entire variable (column): ```axis = 1```
* Lastly, you may choose to leave the missing data as is

In [42]:
df.dropna(subset = ["Sales Channel"], axis = 0) # Returns a new modified dataframe; good for testing
df.dropna(subset = ["Sales Channel"], axis = 0, inplace = True) # Modifies dataframe in-place

In [43]:
avg = df["Units Sold"].mean()
df["Units Sold"] = df["Units Sold"].replace(np.nan, avg) # Replace NaN with mean value

In [44]:
df.head()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Australia and Oceania,Tuvalu,Baby Food,Offline,H,5/28/2010,669165933,6/27/2010,9925,255.28,159.42,2533654.0,1582243.5,951410.5
1,Central America and the Caribbean,Grenada,Cereal,Online,C,8/22/2012,963881480,9/15/2012,2804,205.7,117.11,576782.8,328376.44,248406.36
2,Europe,Russia,Office Supplies,Offline,L,5/2/2014,341417157,5/8/2014,1779,651.21,524.96,1158502.59,933903.84,224598.75
3,Sub-Saharan Africa,Sao Tome and Principe,Fruits,Online,C,6/20/2014,514321792,7/5/2014,8102,9.33,6.92,75591.66,56065.84,19525.82
4,Sub-Saharan Africa,Rwanda,Office Supplies,Offline,L,2/1/2013,115456712,2/6/2013,5062,651.21,524.96,3296425.02,2657347.52,639077.5


#### Formatting

Formatting is the process by which data is transformed to provide a common standard of expression. This facilitates aggregation and comparison. Often this involves performing calculations on an entire column of data to convert it into the desired units or using ```astype()``` to convert data into the correct type.

In [45]:
df["Unit Cost"] = df["Unit Cost"] * 0.89 # Convert USD to Euro
df.rename(columns={"Unit Cost":"Unit Cost (EUR)"}, inplace=True)

In [46]:
df["Unit Price"] = df["Unit Price"].astype("int") # Cast column to integer type

In [47]:
df.dtypes

Region              object
Country             object
Item Type           object
Sales Channel       object
Order Priority      object
Order Date          object
Order ID             int64
Ship Date           object
Units Sold           int64
Unit Price           int64
Unit Cost (EUR)    float64
Total Revenue      float64
Total Cost         float64
Total Profit       float64
dtype: object

#### Normalization

Normalization is the process of scaling values to a range that is consistent with the rest of the data. This enables fair comparison between variables and gives them equal influence on the model and our results. 

There are three main techniques for normalizing data:

* Simple Feature Scaling (range: 0 to 1)

$ 
\begin{align} x_{new} = \frac{x_{old}}{x_{max}} \end{align}
$

* Min-Max (range: 0 to 1)

$ 
\begin{align} x_{new} = \frac{x_{old} - x_{min}}{x_{max} - x_{min}} \end{align}
$

* Z-score aka Standard Score (typical range: -3 to 3)

$ 
\begin{align} x_{new} = \frac{x_{old} - \mu}{\sigma} \end{align}
$

In [65]:
a = {"age": [20, 30, 40], "income": [100000, 20000, 50000]} # Not normalized
df = pd.DataFrame(a)
df

Unnamed: 0,age,income
0,20,100000
1,30,20000
2,40,50000


In [66]:
# Simple Feature Scaling
df["age_sfs"] = df["age"] / df["age"].max()
df["income_sfs"] = df["income"] / df["income"].max()
df

Unnamed: 0,age,income,age_sfs,income_sfs
0,20,100000,0.5,1.0
1,30,20000,0.75,0.2
2,40,50000,1.0,0.5


In [67]:
# Min-Max
df["age_mm"] = (df["age"] - df["age"].min()) / (df["age"].max() - df["age"].min())
df["income_mm"] = (df["income"] - df["income"].min()) / (df["income"].max() - df["income"].min())
df

Unnamed: 0,age,income,age_sfs,income_sfs,age_mm,income_mm
0,20,100000,0.5,1.0,0.0,1.0
1,30,20000,0.75,0.2,0.5,0.0
2,40,50000,1.0,0.5,1.0,0.375


In [71]:
# Z-Score
df["age_z"] = (df["age"] - df["age"].mean()) / df["age"].std()
df["income_z"] = (df["income"] - df["income"].mean()) / df["income"].std()
df

Unnamed: 0,age,income,age_sfs,income_sfs,age_mm,income_mm,age_z,income_z
0,20,100000,0.5,1.0,0.0,1.0,-1.0,1.072222
1,30,20000,0.75,0.2,0.5,0.0,0.0,-0.907265
2,40,50000,1.0,0.5,1.0,0.375,1.0,-0.164957


#### Binning

Binning involves grouping values, often transforming them from numerical into categorical variables.