__Title:__ Lab 1: Visualization and Data Preprocessing  
__Authors:__ Butler, Derner, Holmes  
__Date:__ 1/22/23 

## Ruberic

|Category | Points | Description |
| --- | --- | --- |
| Business Understanding | 10 | Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific.
| Data Meaning Type | 10 | Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
| Data Quality | 15 | Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods.
| Simple Statistics | 10 | Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful. 
| Visualize Attributes | 15 | Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate.
| Explore Joint Attributes | 15 | Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.
| Explore Attributes and Class | 10 | Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).
| New Features | 5 | Are there other features that could be added to the data or created from existing features? Which ones?
| Exceptional Work | 10 | You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.
| Total	| 100


In [3]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#import data
file_load = "data/Combined_Flights_2022.csv"

flight_data = pd.read_csv(file_load, encoding = "utf-8")

flight_data_df = pd.DataFrame(flight_data)
flight_data_df.head()

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,...,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False,False,1133,1123.0,0.0,-10.0,...,1140.0,1220.0,8.0,1245,-17.0,0.0,-2.0,1200-1259,1,0
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False,False,732,728.0,0.0,-4.0,...,744.0,839.0,9.0,849,-1.0,0.0,-1.0,0800-0859,2,0
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,...,1535.0,1622.0,14.0,1639,-3.0,0.0,-1.0,1600-1659,2,0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435,1430.0,0.0,-5.0,...,1446.0,1543.0,4.0,1605,-18.0,0.0,-2.0,1600-1659,2,0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,...,1154.0,1243.0,8.0,1245,6.0,0.0,0.0,1200-1259,2,0


In [None]:
print(flight_data_df.columns.tolist())

#specify that all columns should be shown
pd.set_option('display.max_columns', None)

#view DataFrame
flight_data_df.head()

# Data Meaning Type

Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.


In [None]:
flight_data_df.info()

# Data Quality
Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods.


In [None]:
#Find null values
flight_data_df.isnull().sum()

In [None]:
#drop the records that have missing data
flight_data_df2 = flight_data_df.dropna()

In [None]:
#validate that there is no duplicated data in the dataset
flight_data_df2.duplicated().sum()

In [None]:
# data still has almost 4 million records, feel free to proceed to statistics
flight_data_df2.info()

In [None]:
# let's start by first changing the numeric values to be floats
# continous_features = ['','','','']


#change all numeric fields to one type for univariate analysis

In [None]:
# an the ordinal values to be integers
# ordinal_features = ['','','','']

In [None]:
# we won't touch these variables, keep them as categorical
#categ_features = ['','','','','']

In [None]:
#use the "astype" function to change the variable type
#df[continous_features] = df[continous_features].astype(np.float64)
#df[ordinal_features] = df[ordinal_features].astype(np.int64)

#now the data should be treated better
#df.head()

In [None]:
#Separate Numerical and Categorical variables for easy analysis

cat_cols=df2.select_dtypes(include=['object']).columns
num_cols = df2.select_dtypes(include=np.number).columns.tolist()

print(cat_cols)
print(num_cols)

# Code reference from https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/

# Simple Statistics
Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful. 


In [None]:
flight_data_df2.describe()

# Visualize Attributes
Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate.


In [None]:
#load grpahing package
import seaborn as sns

In [None]:
#Univariate Analysis

for col in num_cols:
    print(col)
    print('Skew :', round(df2[col].skew(), 2))
    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)
    df2[col].hist(grid=False)
    plt.ylabel('count')
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df2[col])
    plt.show()
    
# Code reference from https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/

# Explore Joint Attributes
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

In [None]:
# First step to explore any relationships between data would be to do a correlation
flight_data_df2.corr()

In [None]:
#Correlation plot
sns.heatmap(flight_data_df2.corr())

In [None]:
# Very condensed heat correlation map
corrmat = flight_data_df2.corr()
hm = sns.heatmap(corrmat, 
                 cbar=True, 
                 annot=True, 
                 square=True, 
                 fmt='.2f', 
                 #annot_kws={'size': 50}, 
                 yticklabels=df.columns, 
                 xticklabels=df.columns, 
                 cmap="Spectral_r")
plt.show()

In [None]:
# Example code for boxplot vs target variable (whatever we decide)
sns.catplot(x="target", y="", data=df, kind="box", aspect=1.5)
plt.title("Boxplot for target vs ")
plt.show()

In [None]:
# Code for analyzing scatterplot relationships 
sns.scatterplot(x="", y="", hue="target", data=df, palette="Dark2", s=80)
plt.title("Relationship between 1, 2 and target")
plt.show()

# Explore Attributes and Class
Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

# New Fetaures
Are there other features that could be added to the data or created from existing features? Which ones?

# Modeling

# Exceptional Work