# 1) Data Exploration
Different ways to visualise and understand data.




## a) Sampling and Plotting
In this section you will learn the basics of sampling and plotting using python libraries.
- Generate 1000 samples from a normal distribution with a mean of 5 and standard deviation of 1 using NumPy. Visualize the distribution using a histogram in Matplotlib.


In [None]:
import numpy as np #This library helps us generate and process sample data.
import matplotlib.pyplot as plt #This library is used to visualise data.


In [None]:
norm = np.random.normal(5,1,1000)
print(norm)

In [None]:
plt.hist(norm)
plt.xlabel("Values")
plt.ylabel("Density")
plt.show()

- Repeat for an exponential distribution with an average rate of 3 and visualise it as a density graph.

In [None]:
gam = np.random.gamma(shape=2, scale=4, size=1000)
print(gam)

In [None]:
count, binedge = np.histogram(gam, bins=30)
pdf = count/sum(count)
print(count) #number of values in each bin
print(binedge) #border values of bin
print(pdf) #probability of a number being in a bin
plt.plot(binedge[1:], pdf)
plt.show()

###Practice Problems:
- Generate 1000 samples of a Uniform Distribution between 0 and 10. Visualise using a histogram.


- Generate a beta distribution with shape parameters alpha=2 and beta=5. Plot the probability density function (PDF) and cumulative distribution function (CDF).

## b) Filtering & Plotting from a Data Frame
In this section you will learn how to filter and plot data from a given dataset.

In [None]:
dimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd #This library is used to read dataframes (spreadsheets)

- Read and display the given dataset Star_Properties.csv.

In [None]:
sp = pd.read_csv("Star_Properties.csv")
sp

- Filter the data to display only stars of spectral class "O". Visualise the data as a histrogram sorted by Star Type.

In [None]:
os = sp.loc[sp['Spectral Class'] == 'O']
plt.hist(os['Star type'])
plt.xlabel("Star Type")
plt.ylabel("Count")
plt.show()

- Plot all the stars in a Temperature vs Luminosity scatter plot.

In [None]:
lum = sp['Luminosity(L/Lo)']
temp = sp['Temperature (K)']

plt.scatter(temp, lum, s=100, alpha=0.6, edgecolor='black', linewidth=1)
plt.xlabel('Temperature (K)')
plt.ylabel('Luminosity(L/Lo)')
plt.show()

### Practice Problems
- Filter out stars of Color Red. Visualise this data as a histrogram sorted by temperature.


- Plot the stars with a Temperature in (5000,30000) and Luminosity in (0,0.01) in a Temperature vs Luminosity scatter plot.                                          
Hint: Luminosity is a logarithmic scale, adjust the limits accordingly.

- (Hard) Bonus: The first histogram (Star Type vs Count) is not a good representation of the data. Can you find a way to present the data is an easier format?                                          
Hint: You need to plot a different type of graph.

# 2) Cleaning Data


In this section you will learn how to clean data before using it for ML applications.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## a) NULL Values
- Import the dataset diabetes.xlsx and display it.


In [None]:
db = pd.read_excel("diabetes.xlsx")
db

- Display the columns with missing values.

In [None]:
# Identify missing values
for i in db.columns:
  counts = db[i].value_counts()
  if 0 in counts:  # Check if 0 is in the index
    print(f"Column '{i}' has {counts[0]} values at 0.")

Column 'Glucose' has 5 0 values.
Column 'BloodPressure' has 35 0 values.
Column 'SkinThickness' has 227 0 values.
Column 'Insulin' has 374 0 values.
Column 'BMI' has 11 0 values.
Column 'Outcome' has 500 0 values.


- Deal with the missing values for

i.  Blood Pressure, Insulin and BMI. (using imputation)

ii. Skin Thickness and Glucose. (using forward filling)

In [None]:
#method 1: imputation
for cl in ['BloodPressure', 'Insulin', 'BMI']:
  m=db[cl].median()
  db[cl].replace(0, m, inplace=True)
for cl in ['Glucose','SkinThickness']:
  db[cl].replace(0, method='ffill', inplace=True)
db

- Check for missing values again.

In [None]:
for i in db.columns:
  counts = db[i].value_counts()
  if 0 in counts:  # Check if 0 is in the index
    print(f"Column '{i}' has {counts[0]} values at 0.")

Column 'Outcome' has 500 values at 0.


### Practice Problems
- Import the dataset titanic-train.csv
- Display it and handle the missing values in the dataset.





PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          691
Embarked         2
dtype: int64


## b) Outliers
- Import the diamonds dataset from seaborn.
- Visualise the data as a scatter plot.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Scatter plot of carat vs price
plt.scatter(diamonds['carat'], diamonds['price'])
plt.xlabel('Carat')
plt.ylabel('Price')
plt.title('Scatter Plot of Carat vs Price')
plt.show()


- Identify the outliers in the data.
- Remove the outliers using trimming.

In [None]:
# Remove outliers using trimming
diamonds_trimmed = diamonds[(diamonds['carat'] < 3.5)]

# Scatter plot of carat vs price (trimmed)
plt.scatter(diamonds_trimmed['carat'], diamonds_trimmed['price'])
plt.xlabel('Carat')
plt.ylabel('Price')
plt.title('Scatter Plot of Carat vs Price (Trimmed)')
plt.show()

### Practice Problems:
- Identify outliers in the titanic-train.csv dataset using a boxplot.
- Handle the outliers using trimming.


## c) Duplicates

In [None]:
data = {
    'Name': ['John', 'Mary', 'John', 'David', 'Mary', 'John'],
    'Age': [25, 31, 25, 42, 31, 25],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'New York']
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)

- Given the sample dataset above, identify the number of duplicate entries.


In [None]:
print(df.duplicated().sum())

- Remove the duplicates.

In [None]:
df.drop_duplicates(inplace=True)
print("\nDataset after removing duplicates:")
print(df)

### Practice Problems
- Identify duplicates in the titanic-train.csv dataset.
- Handle duplicates by dropping them.


# 3) Normalisation
Often in machine learning, you will have to deal with datasets that have various sources or data from different types or formats.

To prevent any errors in our application, we need to convert all the data into a common format.
This process is called normalisation.

In this section you will learn about two forms of normalisation.

## a) Encoding
Encoding is the process of converting non-numerical data into numerical data that can be understood by the computer.


In [None]:
from sklearn.preprocessing import OneHotEncoder
# Create a sample dataset with categorical variables
data = {'color': ['red', 'green', 'blue', 'red', 'blue', 'green'],

        'shape': ['circle', 'triangle', 'circle', 'square', 'triangle', 'square']}
df = pd.DataFrame(data)
print("Original dataset:")
df

- Given the above dataset, use one-hot encoding to convert the categorical data (Circle, Square, etc) into numerical data.

In [None]:
# One-hot encode the categorical variables
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df)
encoded_df = encoder.transform(df)
print("One-hot encoded dataset:")
encoded_df

In [None]:
# Convert the encoded data back to a Pandas DataFrame
encoded_df = pd.DataFrame(encoded_df, columns=encoder.get_feature_names_out())
encoded_df

### Practice Problems:
- Using the cleaned titanic-train.csv dataset, encode the data in the Embarked column.

- Use binary encoding to encode the data in the Sex column.

## b) Scaling
Different sources of data can have different ranges and minimum/maximum values. To ensure that all data is given equal importance, we need to standardize the data.

This process is called scaling.

We will be using 2 different types of scaling, Standard Scaling (or z-score scaling) and Min-Max Scaling.
We will be looking at the effect that scaling has on linear regression (will be explained in detail in the next section).

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load California housing dataset
california_housing = fetch_california_housing(as_frame=True)
california_housing

- Given the above dataset about houses in california, identify the target value and feature values.
- You can see that the features are numerical with different means and variations.
- Use a Standard Scaler and a Mix-Max Scaler and see the effects it has on the error in a Machine Learning Model.

In [None]:
# Split the dataset into features (X) and target (y)
chx = california_housing.data
chy = california_housing.target
# Split the dataset into training and testing sets
chx_train, chx_test, chy_train, chy_test = train_test_split(chx, chy, test_size=0.2, random_state=42)

In [None]:
# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the training data and transform both the training and testing data
chx_train_scaled = scaler.fit_transform(chx_train)
chx_test_scaled = scaler.transform(chx_test)

In [None]:
# Create a MinMaxScaler object
min_max_scaler = MinMaxScaler()

# Fit the scaler to the training data and transform both the training and testing data
chx_train_min_max_scaled = min_max_scaler.fit_transform(chx_train)
chx_test_min_max_scaled = min_max_scaler.transform(chx_test)

# 4) Linear Regression
Now, we will be working with one of the basic types of machine learning, linear regression.
- Given the differently scaled versions of the california housing dataset, train the model on all 3 and calculate the error in predictions.

In [None]:
# Create a Linear Regression model
model = LinearRegression()

In [None]:
# Train the model on the original data
model.fit(chx_train, chy_train)
chy_pred = model.predict(chx_test)
mse_original = mean_squared_error(chy_test, chy_pred)

In [None]:
# Train the model on the standard scaled data
model.fit(chx_train_scaled, chy_train)
chy_pred_scaled = model.predict(chx_test_scaled)
mse_scaled = mean_squared_error(chy_test, chy_pred_scaled)

In [None]:
# Train the model on the min-max scaled data
model.fit(chx_train_min_max_scaled, chy_train)
chy_pred_min_max_scaled = model.predict(chx_test_min_max_scaled)
mse_min_max_scaled = mean_squared_error(chy_test, chy_pred_min_max_scaled)

In [None]:
print("MSE on original data:", mse_original)
print("MSE on scaled data (StandardScaler):", mse_scaled)
print("MSE on scaled data (MinMaxScaler):", mse_min_max_scaled)

MSE on original data: 0.5558915986952444
MSE on scaled data (StandardScaler): 0.5558915986952442
MSE on scaled data (MinMaxScaler): 0.5558915986952438


## Practice Problems
- Import the fish_data dataset and view it.

- Identify the features and targets.
- Clean the data as shown in previous sections.

- Create a train-test split and normalise the data.
- Run a linear regression model on the data.
- Find the MSE of your model.