# Explorative Data Analysis - Template - Eval for Data in DB import

### GOAL: 
- state your goal and modifiy your tasks for EDA accordingly
- always add data exploration insights specific extra tasks if necessary 
- identifizieren von interessanten Attributen für die Erstellung einer Snowflake-Datenbank-Schemas

Example apporaches
approaches / techniques
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings

Questions:
- What data types do you expect
- What are relevant validation tasks (i.e. the sum of country emssions cannot be more than word emissions or more blalant - a country cannot have more emissions that he world had)

## Prerequisits:
- Download and extract data in data-Folder
- Install and add required package to load data
- Open a documentation file for the results
- Make sure to add a paragraph to your final README to reflect most important findings

## Imports

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import pandas_profiling

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import clear_output

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import seaborn as sns
import time
import warnings
warnings.filterwarnings('ignore')

# change display of floats in dataframes
pd.set_option('display.float_format', lambda x: "{0:,.0f}".format(x))

#### SOURCE: <url>

## Load data

In [None]:
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

## Explore / Question
- Aufbau des Datensatzen (mehrere Dateien?), Schema? (EVA, multiple Tables)
- NANs, null - Quantifizieren / handlen
- duplicates - quantifizieren, entfernen
- Datentypen - inspizieren und ggf anpassen (sind nur wenige für den allgeineren Datentyp verantwortlich?)
- describe für numerische Werte
- skew / destribution of data / value range - Draw Box-and-Whisker Charts 
- extreme values / outliers / anomalies - Any data value that lies more than (1.5 * IQR) away from the Q1 and Q3 quartiles is considered an outlier

Drop
- irrelavant, not so interesting data, e.g., if we were analyzing data about the general health of the population, the phone number wouldn’t be necessary — column-wise.


In [None]:
# pandas function creates a report from several common EDA commands
def eda(dataframe):
 print(“missing values: {}”.format(dataframe.isnull().sum()))
 print(“dataframe index: {}”.format(dataframe.index))
 print(“dataframe types: {}”.format(dataframe.dtypes))
 print(“dataframe shape: {}”.format(dataframe.shape))
 print(“dataframe index: {}”.format(dataframe.index))
 print(“dataframe describe: {}”.format(dataframe.describe()))
for item in dataframe:
 print(item)
 print(dataframe[item].nunique())

In [None]:
# creates pandas profiling report
pandas_profiling.ProfileReport(df)
df.sample(5)

For example, the code below will generate a bar chart showing how many missing values are in each column of the train dataframe.

In [None]:

# imports:

# plot missing data:
train.isnull().sum().plot(kind='bar')
# Add a title and show the plot.
plt.title('Number of Missing Values Per Column')
# Create tick mark labels on the Y axis and rotate them.
plt.xticks(rotation = 45)
# Create X axis label.
plt.xlabel("Columns")
# Create Y axis label.
plt.ylabel("NaN Values");

## Clean / Verify

Missing values
- .dropna() — drop NaN values
- .fillna() — impute NaN values
- or: impute
====
Dublicates
- .drop_duplicates()— drop duplicate values
Formatting:
- Remove white spaces: "   hello world  " => "hello world
- Pad strings: 313 => 000313 (6 digits)
- Maybe: Fix typos: Strings can be entered in many different ways, and no wonder, can have mistakes. (Gender, m, Male, fem., FemalE, Femle), find unique values for these columns an maybe replace typos with correct values, maybe: use  fuzzy matching
Data-Types
- .astype()— change a column data type

In [None]:
# The first solution is to manually map each value to either “male” or “female”.
dataframe['gender'].map({'m': 'male', fem.': 'female', ...})
# The second solution is to use pattern match. For example, we can look for the occurrence of m or M in the gender at the beginning of the string
re.sub(r"\^m\$", 'Male', 'male', flags=re.IGNORECASE)

## Transform and Aggregate
- i.e. if your aim is to come to year and country data - transform and aggregate your input data accordingly

- Standardize if necessary: lower or upper case, number format, same units where applicable
- scale: to match values, like i.e. 0-100 or gpa 0-5 
- normalize: -> [0,1] if applicable

## Document / Save

In [None]:
# export dataframe to .csv
df.to_csv('export_2018_pricelist.csv', index=False)

## Sources
- https://medium.com/swlh/eda-exploratory-data-analysis-e0f453d97894
- https://towardsdatascience.com/speed-up-your-exploratory-data-analysis-with-pandas-profiling-88b33dc53625
- https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4