
# Machine Learning Fundamentals with Python

Welcome to this notebook on Machine Learning fundamentals! This notebook covers key concepts in data analysis and 
prediction techniques, specifically geared towards understanding the foundations of Machine Learning. We will explore 
concepts such as Mean, Median, Mode, Standard Deviation, Variance, Percentiles, Data Distributions, Scatter Plots, and 
Regression Analysis. Each section is paired with Python code using libraries like NumPy, Pandas, Matplotlib, and Sklearn.

Machine learning is a branch of artificial intelligence that enables systems to learn patterns from data and improve their performance on tasks over time without explicit programming.

![image.png](attachment:image.png)

- AI - Siri
- ML - Spam filters that improce over time
- DL - Facial reckognition

---

### Table of Contents:
1. [Mean, Median, and Mode](#Mean,-Median,-and-Mode)
2. [Standard Deviation and Variance](#Standard-Deviation-and-Variance)
3. [Percentiles](#Percentiles)
4. [Data Distribution](#Data-Distribution)
5. [Normal Data Distribution](#Normal-Data-Distribution)
6. [Scatter Plot](#Scatter-Plot)
7. [Linear Regression](#Linear-Regression)
8. [Polynomial Regression](#Polynomial-Regression)
9. [Multiple Regression](#Multiple-Regression)

---


## 1. Mean, Median, and Mode

### Mean
The mean (average) value is the sum of all values divided by the number of values.

### Median
The median is the middle value of a sorted dataset.

### Mode
The mode is the value that appears most frequently.

Let's start with an example dataset representing car speeds and calculate the mean, median, and mode using NumPy and SciPy.


In [None]:

import numpy as np
from scipy import stats

speed = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]

# Mean
mean_speed = np.mean(speed)
print("Mean:", mean_speed)

# Median
median_speed = np.median(speed)
print("Median:", median_speed)

# Mode
mode_speed = stats.mode(speed)
print("Mode:", mode_speed)



## 2. Standard Deviation and Variance

Standard deviation tells us how spread out values are from the mean. Variance is the square of the standard deviation.


![image-2.png](attachment:image-2.png)

In [None]:

# Calculating Standard Deviation and Variance

# Example dataset
speed = [32, 111, 138, 28, 59, 77, 97]

# Variance
variance = np.var(speed)
print("Variance:", variance)

# Standard Deviation
std_dev = np.std(speed)
print("Standard Deviation:", std_dev)




- **Variance (1432.25)**: This is the average of the squared deviations from the mean. It tells us about the spread of the speeds, but since it’s in squared units, it can be harder to interpret directly.

- **Standard Deviation (37.85)**: Taking the square root of the variance gives the standard deviation, which is 37.85. This value is in the same units as the original data (speed), making it more interpretable. A standard deviation of 37.85 means that the speeds typically vary by about 37.85 units from the mean speed.



## 3. Percentiles

Percentiles are used in statistics to describe the value below which a given percentage of observations fall.


In [None]:

# Calculating Percentiles

# Age data
ages = [5, 31, 43, 48, 50, 41, 7, 11, 15, 39, 80, 82, 32, 2, 8, 6, 25, 36, 27, 61, 31]

# 75th percentile
percentile_75 = np.percentile(ages, 75)
print("75th Percentile:", percentile_75)

# 90th percentile
percentile_90 = np.percentile(ages, 90)
print("90th Percentile:", percentile_90)


- **75th Percentile (48.0)**: This means that 75% of the ages in the list are below 48.0. In other words, 48 is a threshold age, where most of the data points (75%) are younger than this age.

- **90th Percentile (61.0)**: This means that 90% of the ages in the list are below 61.0. So, 61 is a high threshold, indicating that most of the ages fall below this number, with only 10% of ages above 61.



## 4. Data Distribution

In Machine Learning, we often need to analyze and visualize data distributions. Here, we'll generate random data 
and plot it using Matplotlib.


In [None]:

import matplotlib.pyplot as plt

# Generate random data
data = np.random.uniform(0.0, 5.0, 250)

# Plotting Histogram
plt.hist(data, bins=5)
plt.title("Data Distribution Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()



## 5. Normal Data Distribution

A normal distribution is a bell-shaped distribution where most of the data points cluster around the mean.

A normal data distribution, also known as a Gaussian or bell curve, is important in statistics and data analysis for several reasons. When data follows a normal distribution, it opens the door to a variety of powerful statistical tools and techniques, enabling better insights, predictions, and decision-making.

In [None]:

# Generating normal distribution data
normal_data = np.random.normal(5.0, 1.0, 100000)

# Plotting the normal distribution
plt.hist(normal_data, bins=100)
plt.title("Normal Distribution Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()



## 6. Scatter Plot

A scatter plot is a type of data visualization that shows the relationship between two variables.

A scatter plot is a type of data visualization that displays the relationship between two variables on a two-dimensional graph. Each point on the scatter plot represents an individual data point, with the x-axis representing one variable and the y-axis representing the other. Scatter plots are incredibly useful for understanding relationships, trends, and patterns in data.


In [None]:

# Scatter Plot example
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.title("Scatter Plot Example")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.show()



## 7. Linear Regression

Linear regression finds the best-fitting straight line (or "regression line") through the data points, enabling 
predictions of future values.

Linear regression is a statistical method used to model the relationship between two variables by fitting a linear equation (a straight line) to observed data. This technique is often used for predictive analysis, allowing us to predict the value of one variable based on the value of another.


In [None]:

from scipy import stats

# Define data
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

# Linear regression
slope, intercept, r, p, std_err = stats.linregress(x, y)

# Regression function
def linear_model(x):
    return slope * x + intercept

# Plot regression line
plt.scatter(x, y)
plt.plot(x, [linear_model(i) for i in x], color='red')
plt.title("Linear Regression")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.show()



## 8. Polynomial Regression

Polynomial regression fits a curve through data points to model non-linear relationships.

Polynomial regression is a type of regression analysis that models the relationship between the independent variable (X) and the dependent variable (Y) as an 
𝑛
n-th degree polynomial. Unlike linear regression, which fits a straight line to the data, polynomial regression fits a curved line, which can capture more complex relationships.


In [None]:

# Polynomial Regression example
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

# Create model
poly_model = np.poly1d(np.polyfit(x, y, 3))
polyline = np.linspace(1, 22, 100)

# Plot
plt.scatter(x, y)
plt.plot(polyline, poly_model(polyline), color='red')
plt.title("Polynomial Regression")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.show()



## 9. Multiple Regression

Multiple regression predicts a dependent variable based on multiple independent variables.


This exanmple demonstrates how to use multiple linear regression to predict a dependent variable (CO2 emissions) based on multiple independent variables (weight of a cart and engine volume).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from mpl_toolkits.mplot3d import Axes3D

# Sample data
data = {
    'Weight': [2300, 1500, 2000, 1000],
    'Volume': [1300, 1000, 1600, 1200],
    'CO2': [107, 99, 98, 95]
}
df = pd.DataFrame(data)

# Define independent and dependent variables
X = df[['Weight', 'Volume']]
y = df['CO2']

# Train model
model = LinearRegression()
model.fit(X, y)

# Generate predictions for plotting
# Create a grid of weight and volume values to predict CO2 emissions across a range
weight_range = np.linspace(900, 2500, 10)
volume_range = np.linspace(900, 1600, 10)
weight_grid, volume_grid = np.meshgrid(weight_range, volume_range)
X_grid = np.c_[weight_grid.ravel(), volume_grid.ravel()]
y_grid = model.predict(X_grid).reshape(weight_grid.shape)

# Visualize in 3D
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot for actual data
ax.scatter(df['Weight'], df['Volume'], df['CO2'], color='blue', label='Actual Data')

# Surface plot for predicted values
ax.plot_surface(weight_grid, volume_grid, y_grid, color='orange', alpha=0.5, rstride=100, cstride=100, edgecolor='w', linewidth=0.5)

# Set labels
ax.set_xlabel('Weight')
ax.set_ylabel('Volume')
ax.set_zlabel('CO2 Emissions')
ax.set_title('3D Visualization of CO2 Emission Prediction using Multiple Linear Regression')
ax.legend()

plt.show()


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Sample data
data = {
    'Weight': [2300, 1500, 2000, 1000],
    'Volume': [1300, 1000, 1600, 1200],
    'CO2': [107, 99, 98, 95]
}
df = pd.DataFrame(data)

# Define independent and dependent variables
X = df[['Weight', 'Volume']]
y = df['CO2']

# Train model
model = LinearRegression()
model.fit(X, y)

# Predict CO2 for a new car
new_data = pd.DataFrame([[2500, 1300]], columns=['Weight', 'Volume'])
predicted_co2 = model.predict(new_data)
print("Predicted CO2:", predicted_co2[0])


1. **Linear Regression Example: Predicting House Prices Based on Square Footage**

   **Scenario**: Suppose you're a real estate agent trying to predict the selling price of houses based on their square footage. You have data showing that as the square footage of a house increases, so does its price.

   **When Linear Regression Works**:
   - In many housing markets, the relationship between square footage and price tends to be fairly linear, especially for mid-sized homes in similar neighborhoods.
   - Here, you can model this relationship with a straight line using linear regression, where the **square footage** (independent variable) is a strong predictor of **price** (dependent variable).

   **Why Linear Regression Fits**:
   - The relationship is roughly proportional, meaning that every additional square foot adds a similar amount to the price, creating a pattern that a straight line can approximate.
   - Linear regression would give you a simple equation, such as:
     \[
     \text{Price} = m \times \text{Square Footage} + b
     \]
     where \( m \) represents the average increase in price per square foot.

2. **Polynomial Regression Example: Modeling Car Speed vs. Fuel Efficiency**

   **Scenario**: Imagine you're an automotive engineer analyzing how a car’s speed affects its fuel efficiency (measured in miles per gallon, or MPG). As the car accelerates, MPG initially increases to an optimal speed, then starts to decrease at higher speeds due to increased air resistance and engine strain.

   **When Polynomial Regression Works**:
   - The relationship between **speed** and **fuel efficiency** is non-linear. Fuel efficiency increases with speed up to a point but starts to decline once the car exceeds an optimal speed.
   - A polynomial regression (quadratic or cubic) can model this curve, capturing the increase, peak, and subsequent decrease in MPG as speed changes.

   **Why Polynomial Regression Fits**:
   - A quadratic regression model (degree 2) can represent the U-shaped curve, where MPG rises, peaks, and then falls as speed increases:
     \[
     \text{MPG} = a \times (\text{Speed})^2 + b \times \text{Speed} + c
     \]
     where \( a \), \( b \), and \( c \) are coefficients that shape the curve.

3. **Multiple Linear Regression Example: Predicting CO2 Emissions Based on Car Weight and Volume**

   **Scenario**: Imagine you’re an environmental analyst interested in predicting CO2 emissions for cars based on two variables: car weight and engine volume. You have data showing that both weight and volume contribute to CO2 emissions.

   **When Multiple Linear Regression Works**:
   - When you have multiple independent variables (predictors) that influence the dependent variable, multiple linear regression can capture the combined effect of each variable.
   - In this case, both **weight** and **volume** are strong predictors of **CO2 emissions**.

   **Why Multiple Linear Regression Fits**:
   - Multiple linear regression can model the relationship using an equation that includes both variables:
     \[
     \text{CO2} = a \times \text{Weight} + b \times \text{Volume} + c
     \]
     where \( a \) and \( b \) represent the effects of weight and volume on CO2 emissions, respectively, and \( c \) is the intercept.
   - This approach allows us to see how each independent variable (weight and volume) contributes to changes in CO2 emissions.

### Summary Comparison

- **Linear Regression** is best for relationships where one variable changes consistently with the other (like square footage and house price).
- **Polynomial Regression** is ideal when the relationship is more complex and curved, as in the case of fuel efficiency and speed, where the data can’t be well-approximated by a straight line.
- **Multiple Linear Regression** is useful when more than one variable affects the outcome, as in the CO2 emissions example, where both weight and volume impact emissions.

In all examples, selecting the right model helps improve the accuracy of predictions and provides more useful insights into the nature of each relationship.