<a href="https://colab.research.google.com/github/ccwilliamsut/machine_learning/blob/master/MLAB_01_Working_with_the_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This beginner's project is based on many ideas found in *Machine Learning for Absolute Beginners: A Plain English Introduction* by Oliver Theobald. It has been heavily augmented with references found around the web (linked when appropriate).

---

![Machine Learning for Absolute Beginners](https://images-na.ssl-images-amazon.com/images/I/413%2BI3pEaXL.jpg)

---


This book is an excellent introduction for those just beginning their machine learning journey. Though this class introduces a number of principles found in that text, I highly recommend buying the book yourself and proceed through it after your experience here. He walks the reader through a number of important concepts that are too extensive for this course, but his writing is clear and he does a spectacular job of explaining difficult topics to ensure understanding. Additionally, he provides a number examples which the reader can tackle after getting down the basics. 

The book can be found here if you are interested: [Machine Learning for Absolute Beginners](https://www.amazon.com/Machine-Learning-Absolute-Beginners-Introduction/dp/1549617214/ref=sr_1_1?crid=1AF1PFSE85G4F&keywords=machine+learning+for+absolute+beginners+a+plain+english+introduction&qid=1563399014&s=gateway&sprefix=machine+learning+for+absolute+beginners+%2Caps%2C326&sr=8-1)

# Download and Work with the Data

## A. Get the dataset
1. Press "Connect" in the upper right corner of this page (on Colab).

2. The dataset is available through Kaggle.com which requires a login to access. I have also made the dataset available on my Github page, but you should be able to access and import it using the commands in our first code cell. 


- If the following command fails, please go to [Github page](https://github.com/ccwilliamsut/machine_learning/tree/master/absolute_beginners/data_files/modified), download "CaliforniaHousingDataModified.csv" to your downloads folder and uncomment the alternative commands.

- If the above does not work and you are still having trouble, consult this [link](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92) to learn how to work with data files in Colaboratory.

In [0]:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SETUP ENVIRONMENT <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
from IPython.display import display
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)


# --------------------------------------------------- ACQUIRE DATA ---------------------------------------------------
# Set the URL for the data file
url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

# Import the datafile from the provided url and run the cell
df = pd.read_csv(url)

# ------- ALTERNATIVE COMMANDS (if above commands do not work-------
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)

## B. Explore the Data
As you begin to look over your data, it is **important** to consider the following **key concepts**:
1. Does my data have **labels**? (dependent variables; determines whether we will use supervised or unsupervised algorithms)
  - This will determine the type of model that you are able to create (regression or classification)
2. Are there major **issues** with the data
  - This includes things like **null values, few samples, misspellings, derived data, unknown scales or values, etc.**
3. What is my **goal**? (For this course, our goal is to create a model that can predict housing prices based upon the given set of features.)
    - Will I be able to accomplish that goal with this dataset?
    - If not, can I employ feature engineering to create the data I need?
    - If so, which features will contribute to that goal and which are unnecessary?
4. Can I see any **relationships** in the data that might serve as a good foundation for my model?


### i. Basic Code Functions for Analyzing Datasets
If we want to see the names of our **features** (i.e. column headings), we simply issue the first of the following commands. We can also neatly list them out with the second function.

In [0]:
# Look at the first 5 rows of the file
print('The "head" of our data:')
df.head()

In [0]:
# Get a random sample of the data
print('\n\nRandom sample of the data:')
df.sample(5)

In [0]:
# Get a statistical description of the data
print('\n\nStatistical description of the data:')
df.describe()

In [0]:
# List the column headers to see what we are working with
print("\n\nHere are all of our columns:\n")
df.columns

In [0]:
# List all the columns in the dataset in an ordered way
print('\n\nAll of our columns in a list form (much easier to read):\n')
list(df.columns)

In [0]:
# Look at the "shape" of the dataset
print('\n\nLook at the shape a dataset (rows / columns):\n')
df.shape

In [0]:
# Look at some information about the dataset (feature name, total non-null record count,
#   wheter the feature is empty, datatype for each feature)
print('\n\nLook at the info about a dataset (feature name, record count, records exist, datatype):\n')
df.info()

### ii. Look for potential problems

**Finding problems**
- Look at the above data closely. **Talk** about it with your **group/partner** or think about it on your own.
- Can you spot the potential issues that we might have with this dataset?

>**NOTE:** There are **at least 3 problems** that we can identify using the above information. 
- Can you find them all? 
- Can you see any more?
- Do you think that the above data will help our model to be more accurate?
- Should we keep all, some or just a few of these features?

We will not any changes to our dataset just yet, but it is worth noting the potential problems that exist.

Let's next explore the data through **visualization** to get a better idea of any additional problems that might be present.

## C. Visualization of Data
Using a variety of graphs is a great way to explore data. 

> Websites referenced:
- [Overview of graph types and purposes](https://towardsdatascience.com/a-step-by-step-guide-for-creating-advanced-python-data-visualizations-with-seaborn-matplotlib-1579d6a1a7d0)
- [Analyze the data through data visualization using Seaborn (Toward Data Science)](https://towardsdatascience.com/analyze-the-data-through-data-visualization-using-seaborn-255e1cd3948e)
- [Tutorial on visualizing distributions](https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/)
- [Creating multi-dimensional subplots](https://matplotlib.org/3.1.1/gallery/subplots_axes_and_figures/subplots_demo.html)
- [Great demo of the differences between histograms and KDE](https://mglerner.github.io/posts/histograms-and-kernel-density-estimation-kde-2.html)
- [Explanation of KDE](http://www.mvstat.net/tduong/research/seminars/seminar-2001-05/)
- [Analyze the data through data visualization using Seaborn (Toward Data Science)](https://towardsdatascience.com/analyze-the-data-through-data-visualization-using-seaborn-255e1cd3948e)
- [Stackoverflow Help](https://stackoverflow.com/questions/25212986/how-to-set-some-xlim-and-ylim-in-seaborn-lmplot-facetgrid)
- [Seaborn Documentation: Pairplots](https://seaborn.pydata.org/generated/seaborn.pairplot.html)
- [Short Tutorial: Pairplots](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
- [Pairplot Code Examples](https://jovianlin.io/data-visualization-seaborn-part-2/)
- [Stackoverflow: Limiting x- and y-axis values](https://stackoverflow.com/questions/54951362/seaborn-jointplot-with-defined-axes-limits)
- [Seaborn Documentation: Heatmaps](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
- [Excellent tuturial on modifying heatmaps](https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07)
- [Short Tutorial about correlation and heatmaps](https://jovianlin.io/data-visualization-seaborn-part-2/)

### i. Histograms

In [0]:
# Use histograms to explore features in the dataset

# Create a "figure" that can hold multiple plots
fig, (ax0, ax1, ax2) = plt.subplots(nrows = 1,              # create a grid with a specified number of rows
                                    ncols = 3,              # specify the number of columns
                                    figsize = (15, 4),      # specify the size of each subplot
                                    sharey = False          # specify if all plots will share the y-axis values
                                    )

# ------------------ Histograms ---------------------
# Histogram 1 (in slot 'ax1', the first container)
sns.distplot(df['housing_median_age'], 
             kde = False,
             bins = 20, 
             color = 'magenta',
             ax = ax0
             )

# Histogram 2 (in slot 'ax2', the second container)
sns.distplot(df['t_rooms'], 
             kde = False,
             bins = 30, 
             color = 'green', 
             ax = ax1
             )

# Histogram 3 (in slot 'ax3', the third container)
sns.distplot(df['median_house_value'], 
             kde = False,
             bins = 50, 
             color = 'blue',
             ax = ax2
             )



### ii. **Scatterplots**

In [0]:
# Use scatterplots to compare features in the dataset

# Create a "figure" that can hold multiple plots (in this case, 1 row and 3 columns)
fig, ((ax1, ax2, ax3)) = plt.subplots(nrows = 1,                      # create a grid with a specified number of rows
                                      ncols = 3,                      # specify the number of columns
                                      figsize = (20, 5),              # specify the size of each subplot
                                      sharey = False,                 # specify if all plots will share the y-axis values
                                      sharex = False                  # specify if all plots will share the x-axis values
                                      )

# Plot scatterplots on the top row and lineplots on the bottom row
# Scatterplots 1 to 3
ax1 = sns.regplot(x = df['median_income'],
                  y = df['median_house_value'],
                  data = df,
                  ax = ax1, 
                  dropna = True,
                  order = 1,                                          # Note here that we are using polynomial regression
                  line_kws = {'color': 'darkorange'},
                  fit_reg = True
                  )

ax2 = sns.regplot(x = df['total_bedrooms'],
                  y = df['median_house_value'],
                  data = df,
                  ax = ax2, 
                  dropna = True,
                  order = 1,
                  line_kws = {'color': 'purple'},
                  fit_reg = True
                  )

ax3 = sns.regplot(x = df['population'],
                  y = df['median_house_value'],
                  data = df,
                  ax = ax3, 
                  dropna = True,
                  order = 1,
                  line_kws = {'color': 'blue'},
                  fit_reg = True
                  )


# Change the range of a few axes variables to make the graphs more useful
#ax1.set(ylim = (0, 600000))
#ax3.set(ylim = (0, 600000))
#ax3.set(xlim = (0, 15000))

### iii. Countplots


In [0]:
# Create a countplot with a categorical variable
sns.countplot(x = df['ocean_proximity'])

### iv. Categorical Histograms (using Facetgrid)
**Facetgrid plots** allow one to break out data by category while analyzing either univariate or bivariate data.

In [0]:
# Analyze graphs side-by-side (using seaborn Facetgrid)
ghist = sns.FacetGrid(df, 
                      col='ocean_proximity', 
                      hue = 'ocean_proximity', 
                      dropna = True, 
                      legend_out=True, 
                      despine=True
                      )
ghist.map(plt.hist,
          'median_house_value',
          alpha=1,
          bins = 20
          )

# Create another facetgrid with scatterplots
gscat = sns.FacetGrid(df,
                      col='ocean_proximity',
                      hue = 'ocean_proximity',
                      dropna = True,
                      despine=True
                      )
gscat.map(plt.scatter,
          'median_income',
          'median_house_value',
          alpha=.3
          )

### v. Pairplots
With this simple function, we can quickly analyze most of our features against one another to:
- Seek out possible relationships
- Identify possible issues with data
```
```

***Websites referenced:***
- [Seaborn Documentation: Pairplots](https://seaborn.pydata.org/generated/seaborn.pairplot.html)
- [Short Tutorial](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
- [Pairplot Code Examples](https://jovianlin.io/data-visualization-seaborn-part-2/)

In [0]:
# Use seaboarn.pairplot() to quickly look at possible relationships and visually analyze the data

# Use this to analyze each feature against the others while also segmenting out the proximity to the ocean
#sns.pairplot(data = df,
#              hue='ocean_proximity', 
#              dropna=True, 
#             )

# Use this function to analyze each feature against the others with an added regression line to help identify relationships
sns.pairplot(data = df, 
             dropna = True
             )

### vi. Correlation Heatmaps
We can also look at how well data is correlated via the **```seaboarn.heatmap()```** function. This will help to give us an idea of which features are well or poorly correlated to each other.

> *Websites referenced:*
- [Seaborn Documentation: Heatmaps](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
- [Excellent tuturial on modifying heatmaps](https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07)
- [Short Tutorial about correlation and heatmaps](https://jovianlin.io/data-visualization-seaborn-part-2/)

In [0]:
# -- Analyze the data with a heatmap --
# 1. Only keep the columns and data that we want to see correlated
cols = df.drop(['lattitude', 'longitude', 'ocean_proximity'],
               axis=1
               )

# 2. Fill in missing values with the median value of the feature
#     - You can also use the mean (or even the mode if dealing with categorical data)
cols.fillna(cols.median(),
            inplace = True
            )

# 2. Calculate the correlations
corr = cols.corr()

# 3. Set the size you want
plt.figure(figsize=(9 ,9))

# 4. Display the heatmap
sns.heatmap(corr,
            annot=True,
            vmin = -1,
            vmax = 1,
            center = 0
            )

## D. Check for Missing Data
One thing that we need to do before moving on is to see if there is any missing data. If a value is null when the data is imported to Pandas, then the value "NaN" is assigned. Some of our data might have null (```NaN```) values that will cause problems with the performance of our model. 

Missing data can cause a number of problems:
- Further analysis with statistical measures and graphs can **distort our understanding**
- Missing data can **distort our model** if those values affect predictive or clustering calculations
- Some **functions can fail** if they encouter null values

Let's quickly look over our dataset for any missing values.

>Websites Referenced:
- [Kaggle.com tutorial on the usage of 'msno'](https://www.kaggle.com/residentmario/using-missingno-to-diagnose-data-sparsity)

In [0]:
# Use the MissingNo library to visualize where any missing data is
msno.matrix(df, figsize = (10, 5), sparkline = False)

It appears that our **```total_bedrooms```** feature has some missing data. We can investigate further to count the number of records with missing values.

In [0]:
# Get a count of "NaN" or missing values by feature
print('Count of missing values by feature:\n')
display(df.isnull().sum(axis=0))

## E. Summarizing Problems
The primary problems at this point are:
- The spelling of **```lattitude```**
- The meaning of **```t_rooms```**
- The scale used for **```median_income```**
- The scale and distribution used for **```proximity_to_store```**
- **Missing data** in the **```total_bedrooms```** feature
- **Outlier data** in a number of features
- The value limit of $500,000 for **```median_housing_value```**


#### Analyzing the problems
- **Spelling** might not seem important at first (after all, it doesn't directly affect our model's performance), but the spelling is actually a fairly big issue. This is not because of any immediate coding problems, as we could easily write our code and accommodate the misspelling. But when our model moves downstream in other workflows, we will have to make special note of the misspelling to others, lest we make their debugging more problematic and time-comsuming. Worse yet, it might not get caught at all and be shown to leadership as is, possibly leaving a bad impression. **If we think that we will possibly use this feature, then it is better to fix it now than wait for later**. 

- **```t_rooms```** presents another issue, as we cannot be sure exactly what it means at this point. We can guess based on the other data that it likely means "total rooms", but we need to be sure. To get the answer, we (fictionally, in this case) contact the team that collected the data and confirm that **```t_rooms```** should be **```total_rooms```**. Again, we want to think about if we actually need this feature before making the change. It seems likely that the total rooms in a house will have some kind of correlation to the overall price, so let's keep this and change it.

- **```median_income```** and **```proximity_to_store```** present different problems. We do not know the scale being used in either case, and they appear to be different from one another. Additionally, the distribution for proximity to store appears to be uniform, meaning that it might not be adding anything to our model.

- **Missing Data** can cause a number of problems in machine learning, but we will learn how to fix it in the following section (**Scrubbing Data**)

# Takeaways
In this lesson, we have learned how to:
- Setup our environment
- Import data
- Explore data for problems and viability
- Visualize data to identify additional issues and relationships
- Check for missing data

As you explore your dataset, it is important to keep specific things in mind:
- 


### Moving Forward:
We will work through all of the problems listed here (and more!) in our next lesson: **Scrubbing Data**. 