# Introduction to Data Science

## What Libraries?

#### Pandas is a package in Python that helps to handle lots of data. It helps to sort and clean the data.


> From Pands documentation: https://pypi.org/project/pandas/

##### Some highlights:

- Easy handling of **missing data** (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be **inserted and deleted** from DataFrame and higher dimensional objects
- Intelligent **label-based slicing**, **fancy indexing**, and **subsetting** of large data sets
- Intuitive **merging** and **joining** data sets
- Flexible **reshaping** and **pivoting** of data sets


#### Numpy is a package that defines a multi-dimensional array object. 

> From Numpy documentation: https://pypi.org/project/numpy/


#### maplotlib produces quality 2D graphics

> From matplotlib documentation: https://pypi.org/project/matplotlib/

#### seaborn is a library for making statistical graphics

> From seaborn documentation: https://pypi.org/project/seaborn/

Let's get started with a sample data

In [None]:
import pandas as pd # the pd is by convention
import numpy as np # as is the np

import matplotlib.pyplot as plt
import seaborn as sns

# To Plot matplotlib figures inline on the notebook
%matplotlib inline

A Dataframe is essentially a table, like shown below

### Reading Data

Data Description: 

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

Data obtained from data.gov
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data


##### To begin with, we're going to read in some data from a CSV

In [None]:
import pandas as pd
url ='https://raw.githubusercontent.com/ahan02/Data-Science-Hackathon-Workshop-at-UKC-2019/master/data/train.csv'
sample = pd.read_csv(url,index_col=0)
sample.head()

In [None]:
sample.tail()

Let's use a Pandas built-in function to learn more about our data.

In [None]:
sample.info()

In [None]:
sample.describe()

In [None]:
sample.shape

Great, looks like these are all behaving relatively as expected! That's lovely. 

### Getting data from the dataframe

Now let's learn how to grab some data from the DataFrame. 

In [None]:
sample['YearBuilt'].head()

In [None]:
sample['SalePrice']

In [None]:
sample['SalePrice'].value_counts()

**Exercise 1:**

Using the dataframe, figure out about the distribution of the Sale price by sale condition and bedroom above ground 

In [None]:
sample.columns

In [None]:
sample.groupby(['SalePrice', 'SaleCondition']).size()

In [None]:
g = sample.groupby(['SalePrice', 'BedroomAbvGr'])
size = g.size()
size[size > 5]

In [None]:
# Subsetting
newsample=sample[['SalePrice', 'BedroomAbvGr']]
newsample

We'll use `.idxmin` to do the job. It will return the first occurrence of minimum value

In [None]:
newsample.loc[newsample['BedroomAbvGr'].idxmin()]

### Row Lookups

We'll use `.iloc` to do the job. Let's demonstrate by grabbing the first (0th) row.

In [None]:
newsample.iloc[0]

In [None]:
sample.iloc[0]

return multiple rows by following Python's conventions like so:

In [None]:
newsample.iloc[0:3]

In [None]:
sample.iloc[0:3]

Note:iloc means "index location" 

### Filtering

Now we want to filter the data so we only see rows that match a certain criteria. If condition criter is met, it will show true and if not, mark it as a false.

In [None]:
newerhomes = (sample['YearBuilt'] > 2000) 
newerhomes

In [None]:
sample.columns

In [None]:
largerhomes = (sample['BedroomAbvGr'] > 4) 
largerhomes

Now we apply the largerhomes and should see that only rows with that condition!

In [None]:
sample_test = sample[largerhomes] 
sample_test
#sample_test.head()

Multiple condition? no issues

In [None]:
test = (sample['BedroomAbvGr'] > 4) & (sample['YearBuilt'] ==1973)
sample[test]

**Exercise 2:**

Subset sale price with full bath (#), and find the lowest sales price of the house, also return # of the full bath

In [None]:
# Hint: create a filter called mask, then apply it to the dataframe

In [None]:
# Subsetting
newsample=sample[['SalePrice', 'FullBath']]
newsample

In [None]:
newsample.loc[newsample['SalePrice'].idxmin()]

### Doing Stats with Pandas

Let's do some summary statistics

In [None]:
sample['SalePrice'].mean()

Or we could find the max or min 

In [None]:
print(sample['SalePrice'].max())

In [None]:
print(sample['SalePrice'].min())

### Making new columns

Pandas also allows us to create columns that are mixtures of other columns. Let's make a column that is "visibility as a percentage of the maximum visibility".

In [None]:
sample.head(3)

In [None]:
#drop rows
newdf=sample[1:6]
newdf

In [None]:
#drop columns
newdf2=sample.drop(columns=["Street"])
newdf2

In [None]:
sample['Compare_years'] = sample['YearRemodAdd']-sample['YearBuilt']
sample

In [None]:
##check nulls in the data

sample.isnull().sum()

In [None]:
#map missing values

sns.heatmap(sample.isnull(), cbar=False)

### Plotting with Pandas

The last part of pandas we want to explore today is some of it's built in plotting features. let's plot histogram

In [None]:
sample['SalePrice'].plot.hist()
plt.xlabel("Sales price");

**Exercise 3**

Plot a histogram of other columns e.g. 'All Ages, 2012'

In [None]:
# Hint: plot.hist(), plot.line()

In [None]:
sample['YearBuilt'].plot.hist()
plt.xlabel("YearBuilt");

In [None]:
new_sample=sample[['SalePrice','YearBuilt', 'FullBath', 'BedroomAbvGr']]

In [None]:
new_sample

### Plotting with matplotlib

In [None]:
x = new_sample['FullBath']
y = new_sample['SalePrice']

plt.scatter(x, y,  alpha=0.5)
plt.show()

## Correlation Analysis

In [None]:
new_sample.corr()

### Correlation matrix with seaborn

In [None]:
Var_Corr = new_sample.corr()
sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns, annot=True)

In [None]:
def heatMap(df):
    corr = new_sample.corr()
    fig, ax = plt.subplots(figsize=(10, 10))
    colormap = sns.diverging_palette(220, 10, as_cmap=True)
    sns.heatmap(corr, cmap=colormap, annot=True, fmt=".2f")
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.show()

In [None]:
heatMap(new_sample)

In [None]:
pd.plotting.scatter_matrix(new_sample, figsize=(10, 10))

## Challenges

In the data folder, there is a csv that contains the file `Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv`. That data is from here: https://catalog.data.gov/dataset/census-data-selected-socioeconomic-indicators-in-chicago-2008-2012-36e55

This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” by Chicago community area, for the years 2008 – 2012. The indicators are the percent of occupied housing units with more than one person per room (i.e., crowded housing); the percent of households living below the federal poverty level; the percent of persons in the labor force over the age of 16 years that are unemployed; the percent of persons over the age of 25 years without a high school diploma; the percent of the population under 18 or over 64 years of age (i.e., dependency); and per capita income. 

### Challenge 1: Load the file into a Pandas dataframe, then print the top 5 rows

Store this dataframe in a variable called `poverty`

### Challenge 2: Find the mean 'PER CAPITA INCOME' and the mean 'PERCENT AGED 16+ UNEMPLOYED' in the dataset as a whole. 

### Challenge 3: Find max, min and mean of 'HARDSHIP INDEX' and return name of 'COMMUNITY AREA NAME'

### Challenge 4: Plot a histogram for 'HARDSHIP INDEX'

### Challenge 5: Create new column 'New HARDSHIP INDEX' by calculating 'HARDSHIP INDEX'/Mean of 'HARDSHIP INDEX'

### Challenge 6: Identify Null values in the data 

### Challenge 7: Subset data into smaller data set, Let's choose 'COMMUNITY AREA NAME', 'PERCENT HOUSEHOLDS BELOW POVERTY', 'PERCENT AGED 16+ UNEMPLOYED', 'PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA', 'PER CAPITA INCOME', and 'HARDSHIP INDEX'

### Challenge 8: Using new subset from Challenge 7, create another new subset without 'COMMUNITY AREA NAME' (hint: delete column)

### Challenge 9: using new subset in Challenge 8, plot scatter plot using 'HARDSHIP INDEX' (Y) and 'PER CAPITA INCOME' (X)

### Challenge 10: using new subset in Challenge 8, conduct correlation coefficient table

### Challenge 11: using new subset in Challenge 8, construct Correlation plot (heatmap) using seaborn



## Linear Regression Example


In [None]:
#subset all independent variables  
import statsmodels.api as sm
indvar = sub_poverty[['PER CAPITA INCOME']]
dvar = sub_poverty[['HARDSHIP INDEX']]
X = indvar
Y = dvar
X = sm.add_constant(X)

In [None]:
model = sm.OLS(Y, X, missing='drop').fit()
model.summary()