# Python Tutorial: Data Exploration

Data exploration is the process of gaining insights and understanding from a dataset by analyzing its characteristics, distributions, and relationships between variables. Python offers several libraries for data exploration, including pandas, Matplotlib, and Seaborn.
                                                                                                                                                                                            

Steps:
1. Installation.
2. Load libaries.
3. Load the dataset.
4. Explore the dataset.
5. Data cleaning.
6. Data visualization.



## 1. Installation.
  
You can install scikit-learn using pip:


In [None]:
pip install pandas numpy seaborn matplotlib


## 2. Load libraries.

- Pandas : Data structures and operations for manipulating numerical tables and time series.
- Seaborn : High-level interface for drawing attractive and informative statistical graphics.
- Matplotlib : Object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.


In [None]:
# Load libraries 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Ignore warnings
# https://docs.python.org/3/library/warnings.html
import warnings

warnings.filterwarnings('ignore')


## 3. Load the dataset.

The first step is to load the dataset. We will be using the S&P 500 dataset for AAPL.


In [None]:
# Load the data from the csv
df =  pd.read_csv('AAPL_data.csv')


## 4. Explore the dataset.

The next step is to explore the dataset. 

The dataset contains data about:
- date
- open
- high
- low
- close
- volume
- Name


In [None]:
# Display sample rows from the dataset
df.sample(5)


In [None]:
# Total number of rows and columns
df.shape


In [None]:
# Index dtype and columns, non-null values and memory usage
df.info


In [None]:
# Description of the data in the DataFrame
df.describe()


Data type check helps to understand what type of variables our dataset contains.


In [None]:
category_cols = ['category']
category_lst = list(df.select_dtypes(include=category_cols).columns)
print("Total number of categorical columns are ", len(category_lst))
print("There names are as follows: ", category_lst)


In [None]:
int64_cols = ['int64']
int64_lst = list(df.select_dtypes(include=int64_cols).columns)
print("Total number of numerical columns are ", len(int64_lst))
print("There names are as follows: ", int64_lst)


In [None]:
float64_cols = ['float64']
float64_lst = list(df.select_dtypes(include=float64_cols).columns)
print("Total number of float64 columns are ", len(float64_lst))
print("There name are as follow: ", float64_lst)


## 5. Data cleaning.

- Check for missing values
- Check for duplicates
- Convert data types
- Rename columns
- Remove irrelevant columns
- Handle outliers
- Standardize data
    

In [None]:
# Check for missing values
print(df.isnull().sum())


In [None]:
# Drop nan rows
df = df.dropna()


In [None]:
# Convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'])


In [None]:
# Check for duplicates
print(df.duplicated().sum())


In [None]:
# Remove any duplicate rows
df.drop_duplicates(keep=False, inplace=True)


In [None]:
# Rename columns
df.rename(columns={'date': 'Date', 'open': 'Open', 'high': 'High', 'low': 'Low', 'close': 'Close', 'volume': 'Volume'}, inplace=True)
df


In [None]:
# Remove irrelevant columns
#df.drop(['High', 'Low', 'Volume'], axis=1, inplace=True)
#df


In [None]:
# Handle outliers
q1 = df['Close'].quantile(0.25)
q3 = df['Close'].quantile(0.75)
iqr = q3 - q1
upper_bound = q3 + 1.5 * iqr
df = df[df['Close'] <= upper_bound]
df


In [None]:
# Standardize data
scaler = StandardScaler()
df[['Open', 'High', 'Low', 'Close', 'Volume']] = scaler.fit_transform(df[['Open', 'High', 'Low', 'Close', 'Volume']])
df



## 6. Data visualization.

The next step is to visualize the data. 


In [None]:
# Line chart of closing stock price over time
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Close', data=df)
plt.title('Closing Stock Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Stock Price')
plt.show()


The closing stock prices have increased over time, with some fluctuations.


In [None]:
# Box plot of the closing stock prices by year
df['Year'] = df['Date'].dt.year
sns.boxplot(x='Year', y='Close', data=df)
plt.title('Closing Stock Prices by Year')
plt.xlabel('Year')
plt.ylabel('Closing Stock Price')
plt.show()


The closing stock prices have generally increased over the years, with some outliers.


In [None]:
# Create a heatmap of the correlation between stock prices
corr = df[['Open', 'High', 'Low', 'Close']].corr()
plt.figure(figsize=(8,8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Between Stock Prices')
plt.show()


The opening and closing prices have a strong positive correlation, while the low and high prices have a weaker positive correlation.


Visualize the distribution of the target variable, which is the closing stock price. 


In [None]:
# Histogram to visualize the distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['Close'], kde=True)
plt.title('Distribution of Closing Stock Price')
plt.xlabel('Closing Stock Price')
plt.ylabel('Frequency')
plt.show()


To visualize the daily returns, create a line chart. The line chart will show the percentage change in price from one day to the next.


In [None]:
daily_returns = df['Close'].pct_change()

# Create a line chart of the daily returns
plt.plot(daily_returns.index, daily_returns.values)
plt.title('AAPL Daily Returns')
plt.xlabel('Date')
plt.ylabel('Daily Return')
plt.show()
#This will create a line chart showing the daily returns over time.


We can use a combination chart to visualize the stock prices with the volume traded.


In [None]:
# Create a combination plot of stock prices and volume traded
plt.figure(figsize=(12,6))
sns.lineplot(x='Date', y='Close', data=df, color='b')
sns.lineplot(x='Date', y='Volume', data=df, color='g', alpha=0.5)
plt.title('AAPL Stock Prices with Volume Traded')
plt.xlabel('Year')
plt.ylabel('Price/Volume')
plt.legend(['Closing Price', 'Volume'])
plt.show()


In [None]:
# Create a histogram of the daily returns
plt.figure(figsize=(12,6))
sns.histplot(df['Close'].pct_change().dropna(), bins=100, kde=True)
plt.title('AAPL Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency')
plt.show()


## Exercise 1: 

Load a CSV file named 'data.csv' into a DataFrame and display the first 5 rows.


In [None]:
# Solution


## Exercise 2: 

Calculate summary statistics for the DataFrame created in Exercise 1.
                                          

In [None]:
# Solution


## Exercise 3: 

Create a box plot of the 'Income' column from the DataFrame created in Exercise 1.


In [None]:
# Solution


## Summary

Data exploration is a crucial step in the data analysis process, as it helps in understanding the characteristics and patterns present in the data. Python provides powerful tools and libraries for data exploration, allowing you to gain valuable insights from your datasets efficiently.


<details>
<summary><b>Instructor Notes</b></summary>

Nothing to add...

</details>