# MODULE 1 #

## Python Libraries ##


 - Scientific Computing
    - Pandas : Data Structures & Tools ~ Data manipulation & Analytics
    - NumPy : Arrays & Matrices ~ all for data processing (can be used on obj
    - SciPy : Integrals, Differentaial Equations, Optimization
    
- Visualization Libraries
    - MatPlotLib : Plots and graphs
    - Seaborn : plots: heatmaps, time series, violin plots)

- Algorithmic Libraries (machine learning)
    - scikit learn : statistical madeling, regression, classification, etc
    - Statisical Models : explore data, estimate statistical models and perfrom tests.


| Data Types                   | Pandas     | Python |
|------------------------------|------------|--------|
| numbers and strings          | object     | string |
| numeric characters           |  int64     |   int  |
| numeric characters w/decimal | float64    |  float |
| time data                    | datetime64 | datetime module|



In [3]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

url = file_path = "automobile.csv"

df = pd.read_csv(url, header=None)

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df.columns = headers
df.head(10)


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
2,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
3,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
4,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
5,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
6,2,,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.40,8.5,110,5500,19,25,15250
7,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.40,8.5,110,5500,19,25,17710
8,1,,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.40,8.5,110,5500,19,25,18920
9,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.40,8.3,140,5500,17,20,23875


# Module 2 #

## Manipulating Data Frames ## 

- Missing  values
    - replace it with the average/mode if possible
    - drop the value/missing value;     df.dropna()
    

In [None]:
#f['rawheader'] = df['rawheader'] +1

df.dropna(subset = ['curb-weight'], axis=0)  #drop row

#df.dropna(subset = ['curb-weight'], axis=1)  #drop column

df.dropna(subset = ['curb-weight'], axis=0, inplace = True)  #inplace acts directly on data set

#alternative to 'inplace'
df[col].method(value, inplace=True)  #original 

#new
df.method({col: value}, inplace=True)
df[col] = df[col].method(value) 

### Replace with new value ###

In [None]:
df.replace('missing_value', 'new_value')
mean = df.mean({'col_name'})
df.replace(np.nan, mean)

#Replace missing data with frequency
MostFrequentEntry = df['attribute_name'].value_counts().idxmax() 
df['attribute_name'].replace(np.nan,MostFrequentEntry
df['attribute_name'].replace(np.nan,MostFrequentEntry, inplace=True))

#Replace missing data with mean
AverageValue=df['attribute_name'].astype('data_type').mean(axis=0)
df['attribute_name'].replace(np.nan, AverageValue, inplace=True)

### Cleaning Data ###

In [None]:
print(df.dtypes)  ##everything is type object... sad face
# 'normalized-losses' column is missing a ton of data

In [None]:
#drop duplicate rows 
df = df.drop_duplicates

In [None]:
#drop un-needed columns
df = df.drop(columns = 'aspiration')


In [None]:
# remove spaces and NaN values
df['colname'].str.strip('.../,') #will remove / ' ' .
df['colname'] = df['colname'].str.replace('[^a-zA-Z0-9]', '')  ## same affect, diff format
                #cleaning phone numbers, remove everything that is not letter or number
df['colname'] = df['colname'].apply(lambda x:x[0:3] + '-' + x[3:6] + '-' + x[6:10]) ## same output, using lambda

### Convert Data Types ###

In [None]:
#int(df['normalized-losses'])     #gives error message, cant convert obj/series/list to type int

#replace NaN values with 0
df['normalized-losses'] = df['normalized-losses'].replace(np.nan, 0)
df['stroke'] = df['stroke'].replace(np.nan, 0.0)

#convert column to int
df['normalized-losses'] = pd.to_numeric(df['normalized-losses'], errors='coerce').astype('Int64')
#df['symboling'] = pd.to_numeric(df['symboling'], errors='coerce').astype('Int64')
#df['engine-size'] = pd.to_numeric(df['engine-size'], errors='coerce').astype('Int64')
#df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce').astype('Int64')
#df['peak-rpm'] = pd.to_numeric(df['peak-rpm'], errors='coerce').astype('Int64')
#df['city-mpg'] = pd.to_numeric(df['city-mpg'], errors='coerce').astype('Int64')
#df['highway-mpg'] = pd.to_numeric(df['highway-mpg'], errors='coerce').astype('Int64')
#df['price'] = pd.to_numeric(df['priceg'], errors='coerce').astype('Int64')


#convert column to float
#df['wheel-base'] = pd.to_numeric(df['wheel-base'], errors='coerce').astype('Int64').astype(float)
for x in df:
        print(x)
        if x == "unamed: 0":
              
x = 1
for x in df['wheel-base']:
    pd.to_numeric(df['wheel-base'].astype(np.float64))
#df['wheel-base'] = df['wheel-base'].astype(float)
#df['bore'] = pd.to_numeric(df['bore'], errors='coerce').astype('Int64')
#df['stroke'] = pd.to_numeric(df['bore'], errors='coerce').astype('Int64')
#df['compression-ratio'] = pd.to_numeric(df['compression-ratio'], errors='coerce').astype('Int64')
#df.head(10)





ValueError: could not convert string to float: 'wheel-base'

In [None]:
print(df.dtypes)
#print(df.head(10))

In [None]:

# replace missing values
# check your work

## Data Formatting & Normalization ##

- Simple feature scaling
    - new value = old value divided by max value
    - df['col_name'] = df['col_name'] = df['col_name'] / df['col_name'].max()

- min-max
    - new value = old value - min value divided by max value minus min value
    - df['col_name'] = (df['col_name'] - df['len'].mean()) / df['col_name'].std()


- zscore
    - new value = old value minus average divided by standard deviation

- Data Normalization
    - df['attribute_name'] = df['attribute_name']/df['attribute_name'].max()  


### Binning ###

- create n number of grups of low, med, high, etc

In [None]:
bins = np.linspace(min(df['col_name'], max(df['col_name']), 4))
df['col_binned'] = pd.cut(df['col'], bins, labels = 'group_names', include_lowest = True)

#Binning
bins = np.linspace(min(df['attribute_name']), max(df['attribute_name'],n))  # n is the number of bins needed 
GroupNames = ['Group1','Group2','Group3',...]
df['binned_attribute_name'] = pd.cut(df['attribute_name'], bins, labels=GroupNames, include_lowest=True)

In [None]:
#Replace missing data with frequency
MostFrequentEntry = df['attribute_name'].value_counts().idxmax() 
df['attribute_name'].replace(np.nan,MostFrequentEntry, df['attribute_name'].replace(np.nan,MostFrequentEntry, inplace=True))

#Replace missing data with mean
AverageValue=df['attribute_name'].astype('data_type').mean(axis=0)
df['attribute_name'].replace(np.nan, AverageValue, inplace=True)

#Fix the data types
df[['attribute1_name', 'attribute2_name', ...]] = df[['attribute1_name', 'attribute2_name', ...]].astype('data_type')

#data_type is int, float, char, etc. 

#Change column name
df.rename(columns={'old_name': 'new_name'}, inplace=True)

#Indicator Variables
dummy_variable = pd.get_dummies(df['attribute_name'])
df = pd.concat([df, dummy_variable],axis = 1)


## Data Analysis ##

In [None]:
value_counts()

col_counts = df = df['col'].value_counts()
col_counts.rename(columns = {'col' : 'value_counts'}, inplace = True)
col_counts.indexnon = 'col_name'

#Group By
df.groupby()
df_test = df[['col1', 'col2', 'col3']]
df_grp = df_test.groupby(['col1'], ['col2'], as_index = False).mean()

# MODULE 3 #

## Data Visualization ## 

In [None]:
#Load Libraries
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt 
%matplotlib inline

import seaborn as sns

### Standard Line Plot ###

In [None]:
## x is independent variable and y is dependent  variable
plt.plot(x,y)

In [None]:
plt.scatter(x,y)

In [None]:
plt.hist(x,bins)

In [None]:
plt.bar(x,height)

In [None]:
plt.pcolor(C)

### Seaborne Functions ###

- Regression Plot
    - plot draws a scatter plot of two variables, x and y, and then fits the regression model and plots the 
    - resulting regression line along with a 95% confidence interval for that regression. 

In [None]:
sns.regplot(x = 'header_1',y = 'header_2',data= df)

### Box and whisker plot ### 
 - shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be "outliers".

### Residual Plot ### 
- Used to display the quality of polynomial regression. This function will regress y on x as a polynomial regression and then draw a scatterplot of the residuals.
- Residuals are the differences between the observed values of the dependent variable and the predicted values obtained from the regression model. In other words, a residual is a measure of how much a regression line vertically misses a data point, meaning how far off the predictions are from the actual data points.

In [None]:
sns.residplot(data=df,x='header_1', y='header_2')
#sns.residplot(x=df['header_1'], y=df['header_2'])

### KDE plot ### 
- A Kernel Density Estimate (KDE) plot is a graph that creates a probability distribution curve for the data based upon its likelihood of occurrence on a specific value. This is created for a single vector of information. It is used in the course in order to compare the likely curves of the actual data with that of the predicted data.

In [None]:
sns.kdeplot(X)

### Distribution Plot ### 
- s plot has the capacity to combine the histogram and the KDE plots. 
- This plot creates the distribution curve using the bins of the histogram as a reference for estimation.
- You can optionally keep or discard the histogram from being displayed. 
      - can be used interchangeably with the KDE plot

sns.distplot(X,hist=False)

helpful links

https://www.geeksforgeeks.org/python-pandas-dataframe-to_string/

https://stackoverflow.com/questions/39173813/pandas-convert-dtype-object-to-int

https://www.markdownguide.org/cheat-sheet/