# Preliminary Data Analytic Pipeline 
Provide a coded solution for each area below.  Where appropriate show output and explanations/insights.  Make sure it runs properly.
You will need to install the libraries below if required just

!pip install < lib >
* [pandas](https://pandas.pydata.org/)  
   * [Pandas Tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) 
   * [Pandas Example](https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4)
* [numpy](https://numpy.org/) 
   * [Numpy Examples](https://numpy.org/doc/stable/user/quickstart.html)
* [scikit-learn with Examples](https://scikit-learn.org/stable) 
* [ydata_profiling](https://ydata-profiling.ydata.ai/docs/master/pages/getting_started/overview.html)

In [2]:
!pip install pandas 
!pip install numpy 
!pip install scikit-learn 

[0m

In [3]:
import pandas as pd
import numpy as np
import sklearn
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

## Data Integration
Use the fraud data set introduced earlier in the course

In [4]:
# data integration
data = pd.read_csv('transactions.csv')

# displaying the first five rows of the dataset
data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,sex,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


## Format and Type
Determine the format of the file and the types of each feature.

In [5]:
## format and type
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1852394 entries, 0 to 1852393
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Unnamed: 0             int64  
 1   trans_date_trans_time  object 
 2   cc_num                 int64  
 3   merchant               object 
 4   category               object 
 5   amt                    float64
 6   first                  object 
 7   last                   object 
 8   sex                    object 
 9   street                 object 
 10  city                   object 
 11  state                  object 
 12  zip                    int64  
 13  lat                    float64
 14  long                   float64
 15  city_pop               int64  
 16  job                    object 
 17  dob                    object 
 18  trans_num              object 
 19  unix_time              int64  
 20  merch_lat              float64
 21  merch_long             float64
 22  is_fraud          

In [None]:
profile = ProfileReport(
    data, title="Transctions Dataset", html={"style": {"full_width": True}}, sort=None)

profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Analysis
Determine the dynamics of each feature (int/float - math stats, text - categorical or not)

In [14]:
# numerical features analysis
numerical_features = data.select_dtypes(include=['int64', 'float64'])


numerical_stats = numerical_features.describe()
numerical_stats

Unnamed: 0.1,Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud
count,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0
mean,926196.5,4.17386e+17,70.06357,48813.26,38.53931,-90.22783,88643.67,1358674000.0,38.53898,-90.22794,0.005210015
std,534740.2,1.309115e+18,159.254,26881.85,5.07147,13.74789,301487.6,18195080.0,5.105604,13.75969,0.07199217
min,0.0,60416210000.0,1.0,1257.0,20.0271,-165.6723,23.0,1325376000.0,19.02742,-166.6716,0.0
25%,463098.2,180042900000000.0,9.64,26237.0,34.6689,-96.798,741.0,1343017000.0,34.74012,-96.89944,0.0
50%,926196.5,3521417000000000.0,47.45,48174.0,39.3543,-87.4769,2443.0,1357089000.0,39.3689,-87.44069,0.0
75%,1389295.0,4642255000000000.0,83.1,72042.0,41.9404,-80.158,20328.0,1374581000.0,41.95626,-80.24511,0.0
max,1852393.0,4.992346e+18,28948.9,99921.0,66.6933,-67.9503,2906700.0,1388534000.0,67.51027,-66.9509,1.0


In [18]:
# caegorical features analysis
categorical_features = data.select_dtypes(include=['object'])

categorical_stats = categorical_features.describe()
categorical_stats

Unnamed: 0,trans_date_trans_time,merchant,category,first,last,sex,street,city,state,job,dob,trans_num
count,1852394,1852394,1852394,1852394,1852394,1852394,1852394,1852394,1852394,1852394,1852394,1852394
unique,1819551,693,14,355,486,2,999,906,51,497,984,1852394
top,2019-04-22 16:02:01,fraud_Kilback LLC,gas_transport,Christopher,Smith,F,444 Robert Mews,Birmingham,TX,Film/video editor,1977-03-23,0b242abb623afc578575680df30655b9
freq,4,6262,188029,38112,40940,1014749,4392,8040,135269,13898,8044,1


## Clean up
* Find and List number of blank entries and outliers/errors
* Take corrective actions and provide justification
* Remove unnecessary features
* If a categorical approach breakout the input features (X) from the output features (y)

In [None]:
# Clean up

## Normalize
Dont worry about text features but you must normalize the numeric features. 
* Provide rationale as to why the particular normalization feature was selected.

In [None]:
# Normalize the numeric features

## Feature and Label Selection
Down select from your data, the input features and label(s)

In [None]:
# Set up the input features (X) and the assocated label(s) (y)

## Split into 3 data sets for training, validation, and test (Explain your % for each)

In [None]:
# Split

## Summary
# Provide your thoughts on the quality, amount, trustworthiness, diffencencies, timeliness, and available documentation on the data you selected.  This can be written and/or code to demonstrate your conclusions.
* Determine if the data selected is suitable for a machine learning ingest.
* Note there are other prepossessing steps depending on the data such as graphics, free form text, and graphs and/or the type of model such as a time series model.  These topics are covered in the upcoming modules.

In [None]:
# Summary

# Quality Check
After your analysis provide details on the following qualities of your selected data.
* Overall Quality of the data
* Sufficient amount of the data
* Spareness of any data categories (eg. no young adults)
* Trustworthiness of the data (Is it true?)
* Timeliness of the data (is it recent?)  What might be the problem if it is not?
* Note difficenties
* Available document on the data types, how the data was collected, how it was verified?

Provide your answers here for the quality check...