# Lecture 3 Visualization and Preprocessing

In [None]:
import numpy as np
import pandas as pd  # for data analytics

# The two packages below are for plotting
import seaborn as sns
import matplotlib.pyplot as plt

# Type it in the first cell of your notebook, then your matplotlib plots will be included next to your codes. 
# Otherwise, include plt.show()
%matplotlib inline

## 1. Data Loading and Cleaning Basics

To import data from a csv file, use syntax: <font color=#09A4D7>pd.read_csv(FILE_NAME)</font> Make sure the file is under your current directory. You will obtain a Pandas DataFrame (similar to spreadsheet).

- Use <font color=#09A4D7>DataFrame.head(k)</font> to see the first <font color=#09A4D7>k</font> rows of our data. 
- Use <font color=#09A4D7>DataFrame.shape</font> to collect the dimensions. 
- Use <font color=#09A4D7>DataFrame.columns</font> to collect the column names.

### <font color=#BA2121>Practice</font>
<ol>
    <li>Use Pandas package to load the csv file: <a href="train.csv">train.csv</a></li>
    <li>View the first <font color=#09A4D7>10</font> rows of the dataset.</li>
    <li>Report the number of records and number of variables.</li>
    <li>Report all the variable names.</li>
</ol>

In [None]:
# data loading


In [None]:
# first 10 rows

In [None]:
# shape of the dataframe


In [None]:
# variable names


### 1.1 Handling Anomaly
- Use <font color=#09A4D7>Dataframe.drop_duplicates()</font> to keep only unique records.
- Use <font color=#09A4D7>DataFrame.dropna()</font>: to drop missing value
    - By default, drop rows with any missing value
    - <font color=#09A4D7>axis</font>: default = 'index' (i.e., drop rows). If you want to drop columns with any missing, specify <font color=#09A4D7>axis = 'columns'</font>

In [None]:
# drop duplicates


In [None]:
# Drop variables with missing values


### 1.2 Pandas DataFrame

In this section, we focus on indexing and subsampling of a pandas dataframe. This helps with calculation and advanced visualization in upcoming sections.

One advantage of using Pandas DataFrame is to obtain columns (in pandas, a 'Series') easily through "**[ ]**" operator. 
- To obtain one variable, use: <font color=#09A4D7>DataFrame[var_name]</font>
- To obtain multiple variables, use: <font color=#09A4D7>DataFrame[var_list]</font>, where var_list = [var1, var2, ...]

In [None]:
# Selecting one/multiple columns through [] operator


There are two choices for indexing in a DataFrame. 
- **Integer based indexing.** Works similar to numpy array. Syntax: **DataFrame.iloc[row_range, column_range]**
    - You can use a list of integers to select separated elements.
- **Label based indexing.** Extract rows and columns based on corresponding label (e.g., variable name). Syntax: **DataFrame.loc[row_range, column_range]**. 
    - For label indexing, the range is INCLUSIVE!

In [None]:
# Select rows and columns based on integer indexing. 
# This works the same as 2d arrays.


In [None]:
# Selecing several columns/rows based on label indexing 
# By default, row labels are just integers


## 2. Univariate Data Analysis
To obtain summary statistics, use syntax **DataFrame.describe()**: 
- percentiles: the percentiles to report (instead of quantiles). Use a list to specify all percentiles.

In [None]:
# Obtain summary statistics


### 2.1 Visualization Basics
**Labels and Titles**

- Use **plt.xlabel("CONTENT")**, and **plt.ylabel("CONTENT")** to specify the labels. Use **plt.title("TITLE NAME")** to specify the titles.

- Use **plt.xlim(LEFT_LIMIT, RIGHT_LIMIT)** to set range for x axis. Change xlim to ylim for y axis.

- Use **plt.xticks(ticks, labels)** to specify ticks (i.e., what values to include) and **corresponding** labels (i.e., labels of the values). Both should be specified in a list. The length of ticks and labels must match.

- Use **plt.legend([list_of_legends])** to show legend

**Parameter Settings** The specific parameters differ for different plots. But here are some common parameters:
- <font color=#09A4D7>color</font>: color of the lines/bars, 'r', 'g', 'b', 'k' for red, green, blue, and black, correspondingly.
- <font color=#09A4D7>alpha</font>: transparency level. 0<alpha<=1; lower value of alpha => more transparent.
- <font color=#09A4D7>linestyle</font>: the line style (e.g., solid/dash/...). Solid: -, Dashed: --, Dotted: ., Dash-dot: -.

### 2.2 Histogram
We can use syntax: **plt.hist(x)** to create the histogram of x. 
- <font color=#09A4D7>bins</font>: bin number

To create overlap histogram, we use another package: Seaborn. The syntax is: **sns.histplot(data, x, hue)**. Specify parameter name clearly
- <font color=#09A4D7>data</font>: the Pandas DataFrame to use
- <font color=#09A4D7>x</font>: the variable name of x (the one for histogram)
- <font color=#09A4D7>hue</font>: the variable name of category, if you want to create overlapped histogram by a specific variable.
    - It must be a variable in the DataFrame.
For other specifications, check: https://seaborn.pydata.org/generated/seaborn.histplot.html

### <font color=#BA2121>Practice: Histogram</font>
- Draw a histogram of <font color=#09A4D7>SalePrice</font>, set bin number as <font color=#09A4D7>20</font>. Specify color as <font color=#09A4D7>red</font>. Label axes accordingly.
- Draw histogram of two following samples, and overlap the two figures.
  <ol>
    <li>Those with <font color=#09A4D7>OverallQual greater than 6</font>;</li>
    <li>Those with <font color=#09A4D7>OverallQual less or equal to 6</font>.</li> 

In [None]:
## Histogram with detailed labels


In [None]:
## Overlaid Histogram (w. seaborn)
# First, generate a new variable, it takes 1 if quality is high, 0 otherwise


### 2.3 Boxplot
Use syntax **plt.boxplot(x)** to create boxplot of x. It is very tedious to draw a boxplot of different groups using pyplot. This case, use: **sns.boxplot(data, x, y...)**
- <font color=#09A4D7>x</font>: If one-dimensional, then represents the variables to plot. If two-dimensional (i.e., box plot of different groups), then represents the group indicator.
- <font color=#09A4D7>y</font>: Not available if one-dimensional. If two-dimensional, the variable to plot.
- <font color=#09A4D7>orient</font>: = "v" if vertical, ="h" if horizontal

For other specifications: https://seaborn.pydata.org/generated/seaborn.boxplot.html

### <font color=#BA2121>Practice:</font>
<ol>
    <li>Provide a boxplot of SalePrice</li>
    <li>Provide a set of boxplots of 'SalePrice' based on different levels of 'OverallQual' (use Seaborn)</li>
</ol>

In [None]:
# Boxplot basics (sns)


In [None]:
# To provide a set of boxplots, it's a lot easier to use seaborn


## 3. Multivariate Data Analysis

### 3.1 Correlation
To create correlation matrix, use syntax: **.corr()**  Remember that the variables should be specified in a list. Use syntax **sns.heatmap()** to obtain heatmap.

### <font color=#BA2121>Practice:</font>
<ol>
    <li>Create a correlation matrix of the following variables: 'SalePrice','OverallQual','GrLivArea','GarageCars','TotalBsmtSF'</li>
    <li>Use a heatmap to visualize the correlation of the above-mentioned variables</li>
</ol>

In [None]:
# Correlation matrix

In [None]:
# heatmap

In [None]:
# heatmap


### 3.2 Scatter
To produce a scatter plot, use syntax: **plt.scatter(x,y)**   The first entry is horizontal axis, the second entry is vertical axis.

### <font color=#BA2121>Practice</font>
<ol>
    <li>Use scatter plot to explore the relationship between SalePrice and GrLivArea</li>
    <li>Create a scatter plot to show the relationship between Saleprice and GrLivArea, use different colors to show OverallQual. Set transparency as 0.7</li>
</ol>

In [None]:
# Scatter plot basics


In [None]:
# Scatter plot with color
# Scatter plot basics


## 4. Data Preprocessing I 
In most cases, we rely on <font color=#09A4D7>sklearn.preprocessing</font> module to complete data preprocessing. Examples include scaling and encoding. As the implementation of sklearn is consistent across different scenarios (e.g., preprocessing, model training, etc.)

### 4.1 Skewed Data
Consider log transfer and square root transfer. These can be done by simple calculation using numpy (i.e. np.log(), np.sqrt()).

### 4.2 Data Scaling and Encoding
Data scaling and encoding syntax are under Scikit-Learn package Preprocessing Module. Coding when using sklearn packages is longer and follows a standard workflow. We will elaborate the workflow in linear regression section. In the current section, we only provide the syntax and coding template.

For specific parameter settings, please check the sklearn manual. You can simple copy and paste the syntax in Google to find the official manual.

**Standard Scaler:** use syntax: <font color=#09A4D7>sklearn.preprocessing.StandardScaler()</font>

**Min-Max Scaler:** use syntax: <font color=#09A4D7>sklearn.preprocessing.MinMaxScaler()</font>

**One-Hot Encoding:** use syntax: <font color=#09A4D7>sklearn.preprocessing.OneHotEncoder()</font>

**Ordinal Encoding** use syntax: <font color=#09A4D7>sklearn.preprocessing.OrdinalEncoder()</font>