# Data Science

### Definition:
Data Science is the process of extracting meaningful knowledge, insights, and information from data using scientific methods and resources.

### Scientific Methods/Resources:
1. **Machine Learning**: Analyzing structured data like CSV, Excel, MongoDB, SQLite.
2. **Deep Learning**: Working with unstructured data such as images.
3. **Natural Language Processing (NLP)**: Handling text data.
4. **Statistics**: Applying statistical techniques to analyze data.
5. **Data Visualization**: Tools like Seaborn, Matplotlib, and Pandas to create visual representations of data.

### Types of Data:
1. **Structured Data**: Data in a defined format, e.g., CSV, Excel.
2. **Unstructured Data**: Data without a predefined format, e.g., images, videos, text, audio, PDF.
3. **Semi-Structured Data**: Data with some structure, e.g., JSON, HTML.

---

# Machine Learning

### Definition:
Machine Learning (ML) is a subset of AI and a branch of computer science focused on programming systems to automate learning from past data/experiences. ML models improve performance and make predictions based on training data.

### Training and Testing Data:
- **Training Data**: Used to train the model (80% of total data).
  - Example: If the total dataset has 600 samples, the training set will consist of 480 samples.
- **Testing Data**: Unseen data used to test the model's accuracy (20% of total data).
  - Example: For a dataset of 600 samples, the testing set will consist of 120 samples.

---

### Real-World Applications of Machine Learning:
1. **Medical Field**: Disease detection (e.g., Cancer, Covid, Diabetes).
2. **House Price Prediction**: Forecasting real estate prices.
3. **Stock Price Prediction**: Analyzing market trends.
4. **Financial Domain**: Approvals or risk analysis (e.g., high risk, medium, low).
5. **Handwritten Character Recognition**: Identifying characters in scanned documents.
6. **Sentiment Analysis**: Classifying sentiments (e.g., Happy/Sad, Positive/Negative, Spam/Not Spam).
7. **ChatBots**: E-commerce bots like Amazon and Flipkart assistants.
8. **Weather Prediction**: Forecasting future weather conditions.
9. **Speech Recognition**: Converting spoken words into text.
10. **Advertisement**: Targeting users with personalized ads.
11. **Self-Driving Cars**: Object detection and autonomous navigation.
12. **Insurance**: Risk assessment for policies.

---

### Example - Health Insurance:
- **Cancer Insurance**: Coverage up to 5 lakh.
- **Diabetes Insurance**: Coverage up to 10 lakh.


# Linear Regression

## What is Linear Regression?
- **Linear**: Refers to a straight line or path.
- **Regression**: Refers to predicting a continuous value or a real number.

Linear regression is a predictive model used to find the linear relationship between a dependent variable and one or more independent variables.

## Types of Linear Regression

### 1. Simple Linear Regression
   - Involves only **one independent variable**.
   
### 2. Multiple Linear Regression
   - Involves **two or more independent variables**.

## Key Concepts

- The primary goal of linear regression is to find the **Best Fit Line**, which is also known as the **Regression Line**.

### Linear Model Representation
- For a simple linear regression:
  \[
  y = mx + c
  \]
  Where:
  - \( y \) = Dependent variable
  - \( x \) = Independent variable
  - \( m \) = Slope of the line
  - \( c \) = Intercept on the y-axis

- For multiple linear regression with multiple independent variables:
  \[
  y = m_1x_1 + m_2x_2 + m_3x_3 + \dots + m_Nx_N + c
  \]
  Where:
  - \( y \) = Dependent variable
  - \( x_1, x_2,.... x_N \) = Independent variables
  - \( m_1, m_2,... m_N \) = Coefficients (slopes) of the independent variables
  - \( c \) = Intercept

## Variables:
- **Dependent Variable** (\( y \)): A continuous or numeric variable that we aim to predict.
- **Independent Variables** (\( x \)): These can be continuous/numeric or discrete variables used to predict the dependent variable.

## Objective:
The aim of linear regression is to predict the dependent variable \( y \) based on the independent variables \( x \).



# Linear Regression - Detailed Explanation

## When \( y = x \)

- **When the value of \( x \) is the same as \( y \):**
  - \( y = x \)
  - The equation becomes:
    \[
    y = 1 \times x
    \]
  - This results in a line with a slope of 1, i.e., a **45-degree line**.
  - The tangent of 45° is 1, so:
    \[
    \tan(45^\circ) = 1
    \]
  - Therefore, the equation becomes:
    \[
    y = x
    \]

## When \( y \neq x \)

- **Example 1**: When \( x = 1 \), \( y = 2 \); when \( x = 2 \), \( y = 3 \).
- **Example 2**: When \( x = 0 \), what is the value of \( y \)? The value of \( y \) when \( x = 0 \) is the **intercept**.

Thus, the general equation becomes:
\[
y = mx + c
\]
Where:
- \( m \) is the slope, and
- \( c \) is the intercept.

- \( m \) and \( c \) reflect the relationship between \( x \) and \( y \) in your dataset.
- The combination of \( m \) and \( c \) forms the **Best Fit Line (BFL)** or the **model**.

### Example:
- When \( 0 = 25 \), the tangent of \( \theta(0) \) is calculated.
  - **Q**: What happens when we change 1 unit in \( x \)? How does \( y \) change?
  - **Multiple features**: Features represent dimensions in the feature space.
  - **Feature space**: All dimensions of data in the model.

## When we have multiple independent variables:

- For two independent variables:
  \[
  y = m_1x_1 + m_2x_2 + c
  \]

### Feature Space & Model Dimensions:
- **Feature space**: All dimensions of the data.
- **Model dimensions**: One less than the feature space (since we exclude the dependent variable).

## Hyperplanes in Multi-Dimensional Space:
- **Hyperplanes**: When there are more than 4 dimensions, we cannot visualize them, but they possess all the properties of the feature space.
- Linear models work in up to 300 dimensions, whereas the human brain is limited to 3 dimensions.
- All 300 dimensions are orthogonal to each other.

### Assumption of Independence:
- The algorithm assumes that independent variables are independent of each other and do not influence each other.
  
### Example:
- If we want to predict the weight of a car (dependent variable) using horse power (independent variable), in reality, **weight and horse power are related**.
  - More weight generally means more horse power.
  - However, the algorithm assumes that weight and horse power are independent of each other, which may create problems due to **collinearity**.

### Collinearity Problem:
- If there is strong collinearity (relationship between variables), we need to resolve it.
- **Solution**: Dimensionality reduction, such as **Principal Component Analysis (PCA)**, which helps drop one of the correlated dimensions.

## Correlation:

### What is Correlation?
- Correlation indicates how closely the relationship between two variables is.
- We typically use **Pearson’s correlation**, which ranges from **-1 to +1**.
  
### Types of Correlation:
- **Positive Correlation**: As \( x \) increases, \( y \) also increases (closer to +1).
  - **Good predictor** when the correlation value is closer to +1.
  
- **Negative Correlation**: As \( y \) decreases, \( x \) increases (closer to -1).
  - **Good predictor** when the correlation value is closer to -1.

- **No Linear Relationship**: When the correlation is near **0**, it indicates no linear relationship between the variables, making it a **bad predictor**.

### Coefficient of Correlation (R):
- **R** is an indication of the strength of the relationship between two variables.
- **Ideal relationship**: When \( R = 1 \) or \( R = -1 \), the regression line will pass through all points (this situation is rare).
  
### R Value Ranges:
- **Good Predictors**: \( R \) values between **+0.7 to +1** and **-0.7 to -1**.
- **Bad Predictors**: \( R \) values between **-0.3 to +0.3**.
  
### In-depth Study of Features:
- **Features with \( R \) values between -0.7 and -0.3 or between 0.3 and 0.7** need to be studied more carefully.


# Class 2

# Linear Regression - Correlation and Statistical Concepts

## Correlation and R Value:
- **R value near 0**: When data points are scattered, there is no clear linear relationship.
- **R value near +1**: When \( x \) and \( y \) show a **positive linear relation**.
- **R value near -1**: When \( x \) and \( y \) show a **negative linear relation**.

## Computation of Correlation

### Variance:
- Variance measures the **variability** from the mean.
  \[
  V = \frac{\sum (X_i - \hat{X})^2}{n}
  \]
  Where:
  - \( X_i \) is each individual data point,
  - \( \hat{X} \) is the mean of \( X \),
  - \( n \) is the number of data points.

### Central Tendency:
- **Central Tendency**: The statistical measure that identifies a single value as a representation of an entire distribution.
- It aims to provide an **accurate description** of the entire data.
  - Measures of central tendency: **mean**, **mode**, and **median**.
  
- Variance provides the **reliability** of the central value.

### Covariance:
- Covariance measures how two features **vary together** or **influence each other**.
  \[
  \text{Cov} = \frac{\sum (X_i - \hat{X})(Y_i - \hat{Y})}{n}
  \]
  Where:
  - \( X_i \) and \( Y_i \) are the individual data points for variables \( X \) and \( Y \),
  - \( \hat{X} \) and \( \hat{Y} \) are the means of \( X \) and \( Y \),
  - \( n \) is the number of data points.

- **If covariance is 0**, the correlation will also be 0.

### Standard Deviation:
- Standard deviation measures the spread of data points from the mean.
  \[
  \text{std} = \sqrt{\frac{\sum (X_i - \hat{X})^2}{n}}
  \]

### Pearson’s Correlation Coefficient:
- The **Pearson correlation coefficient (R)** is computed as:
  \[
  R(x, y) = \frac{\sum (X_i - \hat{X})(Y_i - \hat{Y})}{\sqrt{\sum (X_i - \hat{X})^2} \cdot \sqrt{\sum (Y_i - \hat{Y})^2}}
  \]

### Linear Regression Model:
- When we build a **linear regression model**, the **Best Fit Line (BFL)** will always pass through the point where the **mean of \( x \)** and the **mean of \( y \)** meet.

### How R Value Approaches 0:
- When we compute \( (X_i - \hat{X}) \) in the quadrant, we get a **symmetric quadrant distribution**.
- When we sum all these values, positive and negative values cancel each other out, making the R value approach **0**.

### How R Value Approaches +1:
- When we compute \( (X_i - \hat{X}) \) in the quadrant, we get an **asymmetric quadrant distribution**.
- When we sum all these values, the positive and negative values do not cancel out and the R value approaches **+1**.

### How R Value Approaches -1:
- Similar to the above, when we compute \( (X_i - \hat{X}) \) in the quadrant, we get an **asymmetric quadrant distribution**.
- When we sum the values, the positive and negative values do not cancel out and the R value approaches **-1**.

## Statistical Fluke:
- **Statistical Fluke**: When \( R \) is between **0.5 to 0.7**, both positive and negative correlations may appear, which requires **further research**.
- This happens due to **random sampling errors** while computing correlation, and it can be a **statistical fluke**.
- A statistical fluke occurs because of problems in the **random sampling** process of the model.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load the dataset
df = pd.read_csv('Iris.csv')

In [None]:
# Display the dataset
print("Dataset Preview:")
print(df.head())  # Display the first few rows

In [None]:

# Check the number of unique values in each column
print("\nNumber of unique values in each column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()}")

In [None]:
# Drop the 'Id' column as it is not relevant for analysis
df.drop('Id', axis=1, inplace=True)

In [None]:
# Display the modified dataset
print("\nModified Dataset after dropping 'Id' column:")
print(df.head())

In [None]:
# Compute the correlation matrix
correlation_matrix = df.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

In [None]:
# Display the count of each species
species_counts = df['Species'].value_counts()
print("\nSpecies Counts:")
print(species_counts)

In [None]:
# Replace species names with numerical values for further analysis
species_mapping = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
df['Species'].replace(species_mapping, inplace=True)

In [None]:
# Display the dataset after mapping species
print("\nDataset after encoding 'Species':")
print(df.head())

In [None]:
# Compute the correlation matrix again after species encoding
correlation_matrix_updated = df.corr()
print("\nUpdated Correlation Matrix:")
print(correlation_matrix_updated)

In [None]:
# Plot pairwise relationships
sns.pairplot(df, diag_kind='kde', hue='Species', palette='Set2')  # Add KDE plots on the diagonal
plt.suptitle("Pairplot of Iris Dataset", y=1.02)  # Add a title
plt.show()

In [None]:
# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))  # Set the figure size
sns.heatmap(correlation_matrix_updated, annot=True, cmap="YlGnBu", fmt=".2f", linewidths=0.5)
plt.title("Heatmap of Correlation Matrix")  # Add a title
plt.show()

# Gradient Descent Algorithm

## 1. Best Fit Line (BFL)
The **Best Fit Line (BFL)** is a straight line that minimizes the overall error between actual data points and predicted points.  
It has the following properties:
- Passes or goes through the **maximum number of data points**.
- **Cannot pass through all data points**.
- Minimizes the **distance between other points and the line**.
- Ensures the **least sum of errors**.
- Always passes through the point where **mean of \(x\) (\(\bar{x}\))** and **mean of \(y\) (\(\bar{y}\))** intersect.

---

## 2. Error/Residual
- **Residual/Error**: The difference between the actual value and the predicted value:  
  \[
  \text{Error (Residual)} = Y_a - Y_p
  \]  
  Where:  
  - \(Y_a\): Actual value  
  - \(Y_p\): Predicted value  

---

## 3. Key Concepts

### Finding the Best Fit Line
- The algorithm (linear model) identifies **one best fit line** from an infinite number of possibilities.
- To accomplish this, the linear model uses a process called **Gradient Descent**.

### Sum of Squared Errors (SSE) or Residual Sum of Squares (RSS)
- Formula:  
  \[
  SSE = \frac{\sum{(Y_a - Y_p)^2}}{n}
  \]  
  Where:  
  - \(Y_a\): Actual values  
  - \(Y_p\): Predicted values  
  - \(n\): Number of data points  

- A **smaller error** indicates a more reliable model or central value.

---

## 4. Quadratic Equation and Convex Function
- When we use the squared term, the equation forms a **quadratic equation**.
- When we plot the values of **m (slope)** and **c (intercept)** against the error, the plot takes a **bowl shape**:
  - This shape is called a **Convex Function**.
  - A convex function guarantees one **absolute minimum (global minimum)**.
- Each point on the "bowl" corresponds to a specific **m, c, and SSE**, representing a line.

---

## 5. Gradient Descent

- **Gradient Descent** involves descending down the gradient of the bowl (convex function) to minimize the error.

### Formula:
The parameters \(m\) (slope) and \(c\) (intercept) are updated iteratively as follows:  
\[
m = m - \alpha \frac{\partial J}{\partial m}
\]
\[
c = c - \alpha \frac{\partial J}{\partial c}
\]

Where:  
- \(J\): Cost function (\(SSE\))  
- \(\frac{\partial J}{\partial m}\): Partial derivative of \(J\) with respect to \(m\)  
- \(\frac{\partial J}{\partial c}\): Partial derivative of \(J\) with respect to \(c\)  
- \(\alpha\): Learning rate (a small positive value that controls step size)  

### >> Gradient Descent:
- Step which is drawn with the help of partial derivative by **optimizing error** → **Learning Step**  
- The learning step becomes **smaller and smaller** in every iteration.

---

## 6. Key Components of Gradient Descent
1. **Partial Derivatives**:  
   - Measures the **rate of change** of the error with respect to small changes in \(m\) and \(c\).  
   - Determines how the error increases or decreases with small changes in \(m\) and \(c\).

2. The algorithm uses partial derivatives internally to optimize the values of \(m\) and \(c\), ensuring that the **error decreases**:  
   \[
   \frac{\partial J}{\partial m} = \frac{-2}{n} \sum (Y_a - Y_p) x
   \]
   \[
   \frac{\partial J}{\partial c} = \frac{-2}{n} \sum (Y_a - Y_p)
   \]

---

## 7. Errors:
- **Residual error (SSE/RSS)**:  
  \[
  \text{Residual} = \text{Actual} - \text{Predicted}
  \]

- **Regression error (SSR)**:  
  \[
  \text{Regression} = \text{Predicted} - \text{Mean}
  \]

- **Total error (SST)**:  
  \[
  \text{Total} = \text{Residual} + \text{Regression} = \text{Actual} - \text{Mean}
  \]

---

## 8. Coefficient of Determination (\(R^2\))

- **Purpose**: To evaluate the model's performance and check its reliability.

### Key Points:
1. **If we have one independent variable**:
   - Called the **Coefficient of Determination**.
   - Denoted as \(R \times R\).
2. **If we have more than one independent variable**:
   - Called the **Coefficient of Multiple Determination**.

- Indicates how much of the **total variance** in \(Y\) is explained by the model.
- \(R^2\) ranges between **0 and 1**:
  - A value closer to **1** indicates higher accuracy.
  - \(R^2 = 1\): Model explains 100% of variance, \(SSE = 0\).
  - \(R^2 = 0\): Model explains none of the variance.
  - \(R^2 < 0\): Indicates poor model performance.

### Formula:
\[
R^2 = 1 - \frac{SSE}{SST}
\]
Where:
\[
SSE = \sum (\text{Actual} - \text{Predicted})^2
\]
\[
SST = \sum (\text{Actual} - \text{Mean})^2
\]

### Example:
- Adding variables can either improve or degrade \(R^2\):
  - Good predictor → \(R^2\) increases.
  - Poor predictor → \(R^2\) slightly decreases.

---

## 9. Adjusted \(R^2\)
- Adjusted \(R^2\) accounts for the number of predictors and penalizes unnecessary complexity.
- A better metric when comparing models with varying numbers of predictors.


# **Adjusted R-Squared and Related Concepts**

## **1. Adjusted R-Squared (A-R²)**

### **What is Adjusted R²?**
- A modified version of R² that accounts for the number of independent variables in the model.
- Increases only when a good predictor is added.
- Always less than or equal to R².

### **Formula**  
\[
A\_R² = 1 - \frac{{(1 - R²) \cdot (n - 1)}}{{n - k - 1}}
\]  
Where:  
- \(n\): Sample size  
- \(k\): Number of independent variables  
- \(R²\): Coefficient of determination  

### **Example Calculation**  
For \(n = 100\), \(k = 8\), and \(R² = 0.93\):  
\[
A\_R² = 1 - \frac{{(1 - 0.93) \cdot (100 - 1)}}{{100 - 8 - 1}} = 0.9238
\]

---

## **2. Assumptions of Linear Regression**

1. **Linearity:**  
   The relationship between dependent and independent variables is linear.

2. **Independence:**  
   Independent variables are not related to each other.

3. **Normality of Errors:**  
   Errors (residuals) should follow a normal distribution.

4. **Homoscedasticity:**  
   Errors should have constant variance across all values of predictors.

5. **No Multicollinearity:**  
   Independent variables should not have high correlation with each other.

6. **Mean of Residuals:**  
   The mean of residuals should be zero.

---

## **3. Variance Inflation Factor (VIF)**

### **What is VIF?**
- Detects multicollinearity among independent variables.

### **VIF Ranges**
- \(1\): Not correlated  
- \(1-5\): Moderately correlated  
- \(>5\): Highly correlated  

---

## **4. Ordinary Least Squares (OLS)**

### **Definition**
- A statistical method to estimate the relationship between variables by minimizing the sum of squared differences between actual and predicted values.

---

## **5. Stochastic Gradient Descent (SGD)**

### **What is SGD?**
- A faster version of gradient descent, especially for large datasets.  
- Uses random sampling instead of the entire dataset for each iteration.

### **Why Use SGD?**
- Efficient for high-dimensional data.
- Faster for large datasets.

---

## **6. Evaluation Metrics for Regression**

### **a) Mean Absolute Error (MAE)**  
**Formula:**  
\[
MAE = \frac{{\sum |y_{\text{actual}} - y_{\text{predicted}}|}}{n}
\]  
**Key Points:**  
- Robust to outliers.  
- Optimal predictions are near the median of target values.

---

### **b) Mean Squared Error (MSE)**  
**Formula:**  
\[
MSE = \frac{{\sum (y_{\text{actual}} - y_{\text{predicted}})^2}}{n}
\]  
**Key Points:**  
- Penalizes larger errors.  
- Sensitive to outliers.

---

### **c) Root Mean Squared Error (RMSE)**  
**Formula:**  
\[
RMSE = \sqrt{\frac{{\sum (y_{\text{actual}} - y_{\text{predicted}})^2}}{n}}
\]  
**Key Points:**  
- Commonly used in deep learning.  
- Sensitive to outliers.

---

## **7. Advantages and Disadvantages of Linear Regression**

### **Advantages**
1. Easy to implement and interpret.  
2. Works well when assumptions hold true.  
3. Scaling does not affect the model.  
4. Useful for dimensionality reduction.

### **Disadvantages**
1. Sensitive to outliers.  
2. Assumes linear relationships between variables.

---


# Project 1: Regression Based


In [None]:
# Iris Dataset Sepal Length Prediction

## Project Overview

This project involves predicting the **Sepal Length** of the Iris dataset based on the other features, including **Sepal Width**, **Petal Length**, **Petal Width**, and **Species**. We will implement a simple machine learning model, focusing on data exploration, preprocessing, visualization, and model evaluation.

---

## 1. Problem Statement

The task is to predict the **SepalLengthCm** using the other available features:
- **Independent Variables:** SepalWidthCm, PetalWidthCm, PetalLengthCm, Species
- **Dependent Variable:** SepalLengthCm

### Business Problem:
Predict `SepalLengthCm` based on various measurements in the Iris dataset to build a model that can provide insights into how these measurements impact sepal length.

---

## 2. Data Gathering

```python
# Import necessary libraries
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For visualization
import seaborn as sns  # For advanced visualization

# Display font caching message
print("Matplotlib is building the font cache; this may take a moment.")
We begin by importing the libraries necessary for the project. Pandas will be used for data manipulation, NumPy for numerical operations, and Matplotlib and Seaborn for visualization.

Next, we load the Iris dataset into a Pandas DataFrame.


# Load the Iris dataset
df = pd.read_csv('Iris.csv')

# Display first 5 rows of the dataset
print(df.head())
Here, the first few rows of the dataset are displayed to get an overview of the data.

Checking Basic Information

# Check the shape (rows and columns)
row_count = df.shape[0]
col_count = df.shape[1]
print(f"Row count: {row_count}, Column count: {col_count}")

# Display column names
print(df.columns)

# Display dataset information
print(df.info())

# Check for null values
print(df.isnull().sum())
We check the number of rows and columns, the column names, and the overall structure of the dataset. We also check for any missing values.

3. Exploratory Data Analysis (EDA)
3.1 Analyze the Id Column

# Display unique values in the 'Id' column
print(df['Id'].unique())

# Count unique values in 'Id'
print(df['Id'].nunique())

# Drop the 'Id' column
df.drop('Id', axis=1, inplace=True)
The Id column is unnecessary for modeling, so we examine its unique values and drop it from the dataset.

3.2 Analyze the SepalWidthCm Column

# Count unique values in 'SepalWidthCm'
print(df['SepalWidthCm'].nunique())

# Visualize the distribution of SepalWidthCm
sns.histplot(df['SepalWidthCm'])
plt.show()

sns.distplot(df['SepalWidthCm'])
plt.show()
We investigate the SepalWidthCm column, look for unique values, and visualize its distribution using histograms. The Seaborn and Matplotlib libraries help visualize the data effectively.

3.3 Correlation Heatmap

# Calculate correlation matrix
correlation_matrix = df.corr()

# Visualize the correlation matrix using a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()
We calculate the correlation between different features and visualize the correlation matrix using a heatmap to understand the relationships between variables.

4. Data Preprocessing
4.1 Handle Missing Values
We ensure that there are no missing values in the dataset.

# Check for missing values
print(df.isnull().sum())
If there were any missing values, we would handle them by either filling or dropping rows/columns. In this case, let's assume no missing values are found.

4.2 Encoding Categorical Variables
Since the Species column is categorical, we will encode it into numeric values.


# Encode the 'Species' column
df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2})
The Species column is mapped to numerical values using the map() function.

5. Model Building
5.1 Train-Test Split
We split the data into training and testing sets to evaluate the model performance later.


from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% training, 20% testing)
X = df.drop('SepalLengthCm', axis=1)  # Features
y = df['SepalLengthCm']  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5.2 Model Selection
We'll use a simple linear regression model to predict SepalLengthCm.

from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)
5.3 Model Evaluation
After training the model, we evaluate it using the testing set and calculate performance metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.


from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predict the target variable on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
The metrics are printed to check how well the model performs.

6. Model Interpretation and Visualization
We can visualize the true vs. predicted values for a better understanding of the model’s performance.


# Scatter plot to visualize true vs predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('True vs Predicted Sepal Length')
plt.show()
The scatter plot shows the relationship between the true and predicted values. A red line represents perfect predictions, where the true values match the predicted ones.

7. Conclusion
The linear regression model successfully predicts the Sepal Length of the Iris dataset.
The model evaluation metrics indicate the prediction quality.
Future improvements could involve using more complex models like decision trees or neural networks for better accuracy.
Future Work
Experiment with other models like Random Forest or XGBoost.
Explore hyperparameter tuning to optimize model performance.
Extend the analysis to predict other features like PetalLengthCm or PetalWidthCm.

# Project 2: Regression Based


# Car Price Prediction

This project aims to predict the price of cars using various attributes like dimensions, engine characteristics, fuel efficiency, and other features. Below is the detailed explanation of the steps involved in this project.

---

## 1. Problem Statement
The objective is to build a regression model that predicts the price of cars based on the given dataset. The dataset contains various features of cars such as engine specifications, dimensions, and performance metrics.

---

## 2. Data Gathering
The data for this project is loaded from a CSV file named `autos_dataset.csv`.

```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('autos_dataset.csv')
df.head().T  # Transpose to view features clearly


In [None]:
3. Loading the Dataset

# Load the Iris dataset into a pandas DataFrame
df = pd.read_csv('Iris.csv')

# Display the first few rows of the dataset to inspect it
print(df.head())  # This will show the first five rows of the dataset
The dataset is loaded into a pandas DataFrame. The first few rows are displayed to inspect the data and get an understanding of its structure.

4. Data Exploration
4.1 Dataset Shape

# Checking the shape of the dataset (number of rows and columns)
row_count = df.shape[0]  # Number of rows
col_count = df.shape[1]  # Number of columns
print(f"Row count: {row_count}, Column count: {col_count}")  # Output the shape of the dataset
The shape of the dataset is checked to understand how many rows and columns are present.

4.2 Column Names

# Display the column names to understand the structure
print(df.columns)
We display the column names to understand the structure of the dataset.

4.3 General Information

# Check general information about the dataset, like data types and non-null counts
print(df.info())
We inspect the general information about the dataset, including data types and missing values.

4.4 Missing Values

# Check for missing values in each column
print(df.isnull().sum())  # This will show the count of null values for each column
We check for missing values in the dataset, which might need handling during data preprocessing.

4.5 Unique Values in 'Id'

# Analyze the 'Id' column, as it may not be useful for prediction
print(df['Id'].unique())  # List all unique values in the 'Id' column
print(df['Id'].nunique())  # Count the number of unique values in 'Id'
We analyze the 'Id' column to check if it is necessary for prediction. If it doesn't contribute to prediction, it will be dropped.

5. Data Cleaning
5.1 Drop 'Id' Column

# Drop the 'Id' column as it doesn't contribute to predicting the Sepal Length
df.drop('Id', axis=1, inplace=True)  # axis=1 specifies that we are dropping a column, not a row
The 'Id' column is dropped as it does not provide useful information for prediction.

5.2 Visualizing 'SepalWidthCm'

# Visualize the distribution of 'SepalWidthCm' using histograms
sns.histplot(df['SepalWidthCm'])  # Visualize the distribution using Seaborn
plt.show()  # Show the plot
We visualize the distribution of the 'SepalWidthCm' feature using histograms to understand its distribution.

6. Data Preprocessing
6.1 Encoding 'Species' Column

# Encode the categorical 'Species' column into numerical format using pd.get_dummies
# This creates binary columns for each category in the 'Species' column
df = pd.get_dummies(df, columns=['Species'], drop_first=True)

# Display the updated dataframe to ensure that the 'Species' column has been encoded correctly
print(df.head())  # Display first few rows of the updated dataframe
The categorical 'Species' column is encoded into numerical format using one-hot encoding.

7. Model Training and Evaluation
7.1 Splitting the Dataset

# Now, we split the dataset into features (X) and target variable (y)
X = df.drop('SepalLengthCm', axis=1)  # Features: Drop the target variable
y = df['SepalLengthCm']  # Target: SepalLengthCm column

# Split the dataset into training and testing sets (80% training, 20% testing)
from sklearn.model_selection import train_test_split  # Importing the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% training and 20% testing
We split the dataset into features (X) and the target variable (y) and further split it into training and testing sets.

7.2 Initializing and Training the Model

# Initialize the Linear Regression model
from sklearn.linear_model import LinearRegression  # Importing the LinearRegression model
model = LinearRegression()  # Creating an instance of the linear regression model

# Train the model using the training data
model.fit(X_train, y_train)  # Fit the model to the training data
We initialize and train a Linear Regression model using the training data.

7.3 Making Predictions

# Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)  # Predict the Sepal Length on the test set
We use the trained model to predict the Sepal Length on the test set.

8. Model Evaluation
8.1 Performance Metrics

# Evaluate the performance of the model using Mean Squared Error and R-squared score
from sklearn.metrics import mean_squared_error, r2_score  # Importing the metrics for evaluation
mse = mean_squared_error(y_test, y_pred)  # Calculate Mean Squared Error (MSE)
r2 = r2_score(y_test, y_pred)  # Calculate R-squared score

# Output the evaluation metrics to check the model's performance
print(f"Mean Squared Error: {mse}")  # Display MSE
print(f"R-squared Score: {r2}")  # Display R-squared score
We evaluate the performance of the model using Mean Squared Error (MSE) and R-squared score.

9. Model Coefficients

# Finally, display the coefficients and intercept of the trained linear regression model
print("Coefficients:", model.coef_)  # Model's coefficients (weight for each feature)
print("Intercept:", model.intercept_)  # Model's intercept (constant term)
We display the model's coefficients and intercept to understand the influence of each feature on the prediction.

Conclusion
The model has been trained and evaluated. The Mean Squared Error and R-squared Score provide insight into the model's accuracy. The coefficients tell us how much each feature influences the prediction of Sepal Length.