# Introduction to Data Science, Lab 5 (10/14)
- Exploratory Data Analysis (EDA)
- Data profiling (manual and with ```pandas_profiling```)

The material in this notebook is based on the Responsible Data Science course taught by Julia Stoyanovich in Spring 2020

### *EDA and Data Profiling*

Explorative Data Analysis (EDA) refers to a systematic methods of analyzing data. EDA was coined in 1961 by John Tukey in an attempt to shift emphasis from statistical hypothesis testing to selecting useful data-inspired hypotheses to test with appropriate statistical tools.

Data Profiling is a subset of EDA that focuses on descriptive statistics of the dataset and assesin data quality before performing more sophisticated EDA. In this sense, data profiling can be viewed as a "pre-processing" stage to EDA. Some intra-column data profiling analysis includes specifying the ```length```, ```type```, ```uniqueness```, ```missingness``` of values. For numeric features, ```minimum```, ```maximum```, ```mean```, ```mode```, ```variation```, ```quantiles``` (for example, summarized in a box-plot).

### *Manual Profiling*

In [None]:
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
iris=load_iris()
data=pd.DataFrame(data=np.c_[iris['data'],iris['target']],columns=iris['feature_names']+['target'])
data.head(5)

In [None]:
# Understand diminsions:
print(data.shape)

In [None]:
# Summary of target:
data['target'].value_counts()

In [None]:
# View summary as histogram:
data['target'].value_counts().plot(kind='bar')
plt.xticks(rotation=25)
plt.show()

In [None]:
# Summary of "petal width (cm)":
data['petal width (cm)'].value_counts()

In [None]:
# Box-plot a numeric column:
data.boxplot(column='petal width (cm)')

In [None]:
# Box-plot multiple numeric columns:
data.boxplot(column=['petal width (cm)','sepal length (cm)','petal length (cm)'])

In [None]:
# A pandas built-in for descriptive statistics:
data.describe()

In [None]:
# What if we have non-numeric types?
data['text']=['lorem ipsum']*len(data)
data.head(5)

In [None]:
display(data.describe())
data.drop("text",axis=1,inplace=True)

In [None]:
# We've alredy seen these:
display(data.head(5))
display(data.tail(5))

In [None]:
# We can also sample some rows randomly: 
data.sample(5)

In [None]:
# We saw this in Lab 1
print(data.columns)
print(data.index) 

In [None]:
# Rename columns:
new_cols=["sepal_length","sepal width","petal_length","petal_width","class"]
data=data.rename(columns={col:new_col for col,new_col in zip(data.columns,new_cols)})
data.columns

In [None]:
# Pandas query:
data.query('petal_length*1.1>sepal_length')

In [None]:
# An alternative method from Lab 2:
data[1.1*data['petal_length']>data['sepal_length']]

You’ll see that this hypothesis doesn’t hold. You get an empty DataFrame back as a result.

Note that this function can also be expressed as iris[iris.Petal_length > iris.Sepal_length]

In [None]:
# Check for missing values:
data.isnull().any()

#### *Handling missing data*
Dealing with missing data is, of course, data- and domain-specific. However, there are two common approaches:
- *Deletion:* sometimes missing values do not carry any significance; e.g., the dataset we downloaded from the internet had compatibility issues when saved as ```.csv``` in Excel, producing some extraneous rows with *NaNs*. Deleting rows or even columns with missing data in such cases is safe;
- *Imputation*: missing values sometimes do carry information and discarding them might bias the analysis (can you think of examples?). In such cases, missing data is often substituted with a column mean, mode, or median. In more sophisticated analyses, a missing value can be substituted with an extraplated value (e.g., predicted with k-NN based on other features). Additionally, it is sometimes effective to add a binary feature to the data indicating which values from a particular column were missing (no information loss).

In [None]:
data=pd.read_csv('property data.csv')
print(data.shape)
data

In [None]:
# Note that .isnull() counts only NaNs:
data.isnull().sum()

In [None]:
data=pd.read_csv('property data.csv',na_values=["na","--"])
data.isnull().sum()

To impute missing values properly, we need to understand types of variables (e.g., we cannot impute a missing value of a categorical variable with mean).

In [None]:
# Determine types of variables:
for col in data.columns:
    print(f'{col}: {data.loc[:,col].apply(lambda x: type (x)).unique()}')

In [None]:
# Add a feature indicating a missing value in 'SQ_FT':
data['SQ_FT MV']=data['SQ_FT'].isnull()
data

In [None]:
# Imputate missing values in 'SQ_FT' with mean:
mean=np.mean(data['SQ_FT'])
data['SQ_FT'].fillna(mean,inplace=True)
data

In [None]:
# Pandas also provides linear interplotation:
data['PID'].interpolate(inplace=True)
data

In [None]:
iris=load_iris()
data=pd.DataFrame(data=np.c_[iris['data'],iris['target']],columns=iris['feature_names']+['target'])

In [None]:
# Pearson correlation:
data.corr()

In [None]:
# Pearson correlation visualized:
corr=data.corr()
corr.style.background_gradient(cmap='bwr')

### *Pandas Profiling Library*

Pandas delivers a ```pandas_profiling``` library that automatically generates reports from a pandas DataFrame.
For each column in the DataFrame, ```panas_profiling``` reports the following statistics (when applicable):

- Overview (type,unique values,missing values, etc.);
- Descriptive statistics (range, quantiles, etc.)
- Histograms and correlation matrices


In [None]:
import pandas_profiling
data=pd.read_csv("Meteorite_Landings.csv",encoding='UTF-8')
print(data.shape)
data.head(5)

In [None]:
pandas_profiling.ProfileReport(data)

In [None]:
# Save report:
ppf=pandas_profiling.ProfileReport(data)
ppf.to_file("pandas-profiling.html")

### *Mask Analysis*

The mask analysis (string pattern analysis) discovers the structure of values symbol-by-symbol. Symbols ar partitioned into and encoded as
- lower case letter: 'l';
- capital case letter: 'L'
- digit: 'D'
- space: 's'
- missing value: '-null-'

Special characters (all other symbold, e.g. ?!^#@) are left uncoded. Examples of mask analysis:
- 'Van' returns 'Lll'
- 'VAN' returns 'LLL'
- 'Van BC' returns 'LllsLL'
- '+1 123-1234-5555 returns '+DsDDD-DDDD-DDDD'

In [None]:
data=pd.read_csv('2017business_licences.csv')
print(data.shape)
print(data.columns)
data.head(5)

In [None]:
# Mask analysis:
def getMask(field):
    mask=''
    if str(field)=='nan':
        mask='-null-'
    else:    
        for character in str(field):
            if 65<=ord(character)<=90: # ascii 65 to 90 are capital letters;
                mask+='L'                
            elif 97<=ord(character)<=122: # ascii 97 to 122 are lower case letters;
                mask+='l'
            elif 48<=ord(character)<=57: # ascii 48 to 57 are digits.
                mask+='D'
            elif ord(character)==32:
                mask+='s'
            else:
                mask=mask+character
    return mask

def mask_profile(series):    
    value=series.apply(getMask).value_counts()
    percentage=round(series.apply(getMask).value_counts(normalize=True)*100,2)
    result=pd.DataFrame(value)
    result['%']=pd.DataFrame(percentage)
    result.columns=['Count','%']
    return result

In [None]:
mask_profile(data['LicenceNumber']).head(5)

In [None]:
mask_profile(data['House']).head(5)

In [None]:
mask_profile(data['PostalCode']).head(5)