# Week 6 - Exploratory Data Analysis (EDA) with Python

## Introduction to EDA
Exploratory Data Analysis is a critical step in the data analysis process which involves analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It allows data analysts to uncover patterns, spot anomalies, test a hypothesis, or check assumptions with the help of summary statistics and graphical representations.

## Objectives:
- Understand the principles of Exploratory Data Analysis.
- Learn to conduct a basic EDA using Python.
- Familiarize with Python libraries like Pandas, NumPy, and Matplotlib for data analysis.

## Topics Covered:
- Data Ingestion
- Data Cleaning
- Univariate Analysis
- Bivariate and Multivariate Analysis
- Data Transformation and Feature Engineering
- Outlier Detection
- Use of Statistical Methods
- Data Visualization

## Activities:

### Data Ingestion and Cleaning:
```python
import pandas as pd

# Loading the dataset
df = pd.read_csv('dataset.csv')

## Data Profiling

In [14]:
import pandas as pd

In [15]:
df = pd.read_csv('./Data Sets/data set 5~alt_fuel_stations(Nov-3-2023).csv', low_memory=False)

In [16]:
df

Unnamed: 0,Fuel Type Code,Station Name,Street Address,Intersection Directions,City,State,ZIP,Plus4,Station Phone,Status Code,...,Restricted Access,RD Blends,RD Blends (French),RD Blended with Biodiesel,RD Maximum Biodiesel Level,NPS Unit Name,CNG Station Sells Renewable Natural Gas,LNG Station Sells Renewable Natural Gas,Maximum Vehicle Class,EV Workplace Charging
0,CNG,Spire - Montgomery Operations Center,2951 Chestnut St,,Montgomery,AL,36107,,,E,...,,,,,,,False,,MD,
1,CNG,Metropolitan Atlanta Rapid Transit Authority,2424 Piedmont Rd NE,,Atlanta,GA,30324,,,E,...,,,,,,,,,LD,
2,CNG,United Parcel Service,270 Marvin Miller Dr,,Atlanta,GA,30336,,,E,...,,,,,,,,,HD,
3,CNG,Arkansas Oklahoma Gas Corp,2100 S Waldron Rd,,Fort Smith,AR,72903,,479-783-3181,E,...,False,,,,,,False,,MD,
4,CNG,Clean Energy - Logan International Airport,1000 Cottage St Ext,"From Route 1, take the first exit after Callah...",East Boston,MA,2128,,866-809-4869,E,...,False,,,,,,True,,MD,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78368,ELEC,12102 S. Elk Creek Road (US-45Y-J5H-1),12102 S. Elk Creek Road,,Pine,CO,80470,,866-576-1495,E,...,,,,,,,,,,False
78369,ELEC,Hecht Warehouse Garage,1474 Okie St. NE.,,Washington,DC,20002,,855-223-5351,E,...,,,,,,,,,,False
78370,ELEC,Tower Square Garage,1500 Main Street,,Springfield,MA,1115,,888-356-8911,E,...,,,,,,,,,,False
78371,ELEC,The House by the Side of the Road,370 Gibbons Hwy,,Wilton,NH,3086,,888-356-8911,E,...,,,,,,,,,,False


In [17]:
print(df.describe)


<bound method NDFrame.describe of       Fuel Type Code                                  Station Name  \
0                CNG          Spire - Montgomery Operations Center   
1                CNG  Metropolitan Atlanta Rapid Transit Authority   
2                CNG                         United Parcel Service   
3                CNG                    Arkansas Oklahoma Gas Corp   
4                CNG    Clean Energy - Logan International Airport   
...              ...                                           ...   
78368           ELEC        12102 S. Elk Creek Road (US-45Y-J5H-1)   
78369           ELEC                        Hecht Warehouse Garage   
78370           ELEC                           Tower Square Garage   
78371           ELEC             The House by the Side of the Road   
78372           ELEC                              Bill Luke Santan   

                Street Address  \
0             2951 Chestnut St   
1          2424 Piedmont Rd NE   
2         270 Marvin Mi

In [18]:
df.describe()

Unnamed: 0,Plus4,EV Level1 EVSE Num,EV Level2 EVSE Num,EV DC Fast Count,Latitude,ID,Federal Agency ID,Intersection Directions (French),Access Days Time (French),BD Blends (French),CNG Dispenser Num,CNG Total Compression Capacity,CNG Storage Capacity,EV Pricing (French),RD Blends (French),RD Maximum Biodiesel Level
count,0.0,727.0,58794.0,9152.0,78372.0,78373.0,1330.0,0.0,0.0,0.0,960.0,604.0,304.0,0.0,0.0,536.0
mean,,4.441541,2.366024,4.037478,37.831102,177138.141835,13.543609,,,,2.738542,912.82947,48026.868421,,,5.195896
std,,9.527202,3.134686,4.947789,5.026929,74447.861329,6.132839,,,,5.226803,1009.900558,56669.309084,,,1.704546
min,,1.0,1.0,1.0,0.0,17.0,2.0,,,,0.0,2.0,0.0,,,5.0
25%,,1.0,2.0,1.0,34.045715,122520.0,8.0,,,,1.0,250.0,30000.0,,,5.0
50%,,2.0,2.0,2.0,38.5207,181572.0,14.0,,,,2.0,700.0,36000.0,,,5.0
75%,,3.0,2.0,6.0,41.537373,226883.0,16.0,,,,2.0,1200.0,58421.5,,,5.0
max,,121.0,338.0,84.0,64.852466,316639.0,29.0,,,,65.0,8250.0,593136.0,,,20.0


In [19]:
# Inspecting the first few rows
print(df.head())

# View data types and non-null counts for each column
print(df.info())

# Descriptive statistics
print(df.describe())

  Fuel Type Code                                  Station Name  \
0            CNG          Spire - Montgomery Operations Center   
1            CNG  Metropolitan Atlanta Rapid Transit Authority   
2            CNG                         United Parcel Service   
3            CNG                    Arkansas Oklahoma Gas Corp   
4            CNG    Clean Energy - Logan International Airport   

         Street Address                            Intersection Directions  \
0      2951 Chestnut St                                                NaN   
1   2424 Piedmont Rd NE                                                NaN   
2  270 Marvin Miller Dr                                                NaN   
3     2100 S Waldron Rd                                                NaN   
4   1000 Cottage St Ext  From Route 1, take the first exit after Callah...   

          City State    ZIP  Plus4 Station Phone Status Code  ...  \
0   Montgomery    AL  36107    NaN           NaN           E  ...

## Variable Identification

In [20]:
# Categorize variables by type
categorical = df.select_dtypes(include=['object']).columns.tolist()
numerical = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

## Data Cleaning

In [21]:
# Clean nulls (NaN) as True
df.isna()

Unnamed: 0,Fuel Type Code,Station Name,Street Address,Intersection Directions,City,State,ZIP,Plus4,Station Phone,Status Code,...,Restricted Access,RD Blends,RD Blends (French),RD Blended with Biodiesel,RD Maximum Biodiesel Level,NPS Unit Name,CNG Station Sells Renewable Natural Gas,LNG Station Sells Renewable Natural Gas,Maximum Vehicle Class,EV Workplace Charging
0,False,False,False,True,False,False,False,True,True,False,...,True,True,True,True,True,True,False,True,False,True
1,False,False,False,True,False,False,False,True,True,False,...,True,True,True,True,True,True,True,True,False,True
2,False,False,False,True,False,False,False,True,True,False,...,True,True,True,True,True,True,True,True,False,True
3,False,False,False,True,False,False,False,True,False,False,...,False,True,True,True,True,True,False,True,False,True
4,False,False,False,False,False,False,False,True,False,False,...,False,True,True,True,True,True,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78368,False,False,False,True,False,False,False,True,False,False,...,True,True,True,True,True,True,True,True,True,False
78369,False,False,False,True,False,False,False,True,False,False,...,True,True,True,True,True,True,True,True,True,False
78370,False,False,False,True,False,False,False,True,False,False,...,True,True,True,True,True,True,True,True,True,False
78371,False,False,False,True,False,False,False,True,False,False,...,True,True,True,True,True,True,True,True,True,False


In [22]:
df.isna().sum()

Fuel Type Code                                 0
Station Name                                   1
Street Address                                41
Intersection Directions                    73513
City                                           4
                                           ...  
NPS Unit Name                              78187
CNG Station Sells Renewable Natural Gas    77534
LNG Station Sells Renewable Natural Gas    78305
Maximum Vehicle Class                      59832
EV Workplace Charging                      11299
Length: 74, dtype: int64

In [23]:
# Handling missing values
df.dropna(inplace=True)  # Drop rows with missing values
df.fillna(0, inplace=True)  # Fill missing values with zeros

# Correcting data types
df['column_name'] = df['column_name'].astype('int')  # Convert column to integer type

KeyError: 'column_name'

## Univariate Analysis

In [None]:
import matplotlib.pyplot as plt

# Histogram for numerical data
df['numerical_column'].hist(bins=50)
plt.show()

# Bar chart for categorical data
df['categorical_column'].value_counts().plot(kind='bar')
plt.show()

KeyError: 'numerical_column'

In [None]:
# Class Project Objectives:
# Determine the current state and potential growth of the EV market in the U.S.:
# 	- Is investment in EVs increasing?
# 	- How does the US compare to the rest of the world?
# 2. Identify key areas or regions where EV charging infrastructure is lacking or in high demand:
#         - What State to invest in? Why?
#         - What State to AVOID investing in? Why?
#         - What city to invest in? Why?
#         - What city to AVOID investing in? Why?
#         - Other relevant geographical features

## Bivariate/Multivariate Analysis

In [None]:
# Scatter plot for numerical variable relationships
plt.scatter(df['numerical_column_1'], df['numerical_column_2'])
plt.xlabel('Column 1')
plt.ylabel('Column 2')
plt.show()

# Correlation matrix
correlation_matrix = df[numerical].corr()
print(correlation_matrix)

## Handling Outliers

In [None]:
# Box plot to visualize outliers
df.boxplot(column=['numerical_column'])
plt.show()

## Feature Engineering

In [None]:
# Create a new feature
df['new_feature'] = df['numerical_column_1'] / df['numerical_column_2']

## Data Transformation

In [None]:
# Log transformation
df['log_transformed'] = np.log(df['numerical_column'] + 1)

## Correlation Analysis

In [None]:
# Heatmap of correlation matrix
import seaborn as sns

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

## Documentation & Iterative Analysis

Ensure all steps and findings are well-documented, which is crucial for reproducibility and communication with others. The analysis should be iterative, refining techniques based on insights as they emerge.

## Conclusion

These techniques and visualizations form the backbone of EDA in Python. They enable the analyst to understand the data's structure, relationships, and patterns before proceeding to more complex analyses or building predictive models.