<a href="https://colab.research.google.com/github/alfonsoayalapaloma/ml-2024/blob/main/ds_eda_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">


## <center> Regression

## TIPS Dataset

The `tips` dataset is a popular dataset included in the seaborn library, often used for demonstrating data visualization and statistical analysis techniques. It contains information about tips received by waitstaff in a restaurant, along with various attributes related to the dining experience.

### Overview
The dataset consists of 244 observations and 7 variables. Here are the details of each variable:

1. **total_bill**: The total bill (cost of the meal) in dollars.
2. **tip**: The tip amount in dollars.
3. **sex**: The gender of the person paying the bill (either "Male" or "Female").
4. **smoker**: Whether the person was a smoker or not (either "Yes" or "No").
5. **day**: The day of the week when the meal was served (either "Thur", "Fri", "Sat", or "Sun").
6. **time**: The time of day when the meal was served (either "Lunch" or "Dinner").
7. **size**: The size of the dining party.

### Example
Here's a small sample of what the dataset looks like:

| total_bill | tip  | sex   | smoker | day  | time   | size |
|------------|------|-------|--------|------|--------|------|
| 16.99      | 1.01 | Female| No     | Sun  | Dinner | 2    |
| 10.34      | 1.66 | Male  | No     | Sun  | Dinner | 3    |
| 21.01      | 3.50 | Male  | No     | Sun  | Dinner | 3    |
| 23.68      | 3.31 | Male  | No     | Sun  | Dinner | 2    |
| 24.59      | 3.61 | Female| No     | Sun  | Dinner | 4    |

### Applications
The `tips` dataset is commonly used for:
- **Data Visualization**: Creating various plots to explore relationships between variables (e.g., scatter plots, bar plots, box plots).
- **Statistical Analysis**: Performing statistical tests to understand the significance of relationships between variables.
- **Machine Learning**: Building and evaluating predictive models (e.g., predicting tip amount based on other features).

### Example Analysis
For instance, you might use the `tips` dataset to explore questions like:
- How does the tip amount vary with the total bill?
- Are tips higher during dinner compared to lunch?
- Do smoking and non-smoking customers tip differently?
- Is there a difference in tipping p with an analysis, feel free to ask!

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load the tips dataset from seaborn
tips = sns.load_dataset('tips')

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(tips.head())

cols=["total_bill","tip","size"]

In [None]:
# Display basic statistics of the dataset
print("\nBasic statistics of the dataset:")
print(tips.describe())


In [None]:
# Display information about the dataset
print("\nInformation about the dataset:")
print(tips.info())

In [None]:
# Check for missing values
print("\nMissing values in the dataset:")
print(tips.isnull().sum())


In [None]:
# Distribution of total_bill
plt.figure(figsize=(10, 6))
sns.histplot(tips['total_bill'], kde=True)
plt.title('Distribution of Total Bill')
plt.xlabel('Total Bill')
plt.ylabel('Frequency')
plt.show()


In [None]:
# Distribution of tip
plt.figure(figsize=(10, 6))
sns.histplot(tips['tip'], kde=True)
plt.title('Distribution of Tip')
plt.xlabel('Tip')
plt.ylabel('Frequency')
plt.show()


In [None]:
sns.set_theme(style="ticks")
cols_to_plot=["total_bill","tip","size"]
sns.pairplot(tips[cols_to_plot], kind='reg', diag_kind='kde',
             plot_kws={'line_kws':{'color':'red'}})

In [None]:
# Box plot of total_bill by day
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=tips)
plt.title('Box Plot of Total Bill by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.show()


In [None]:
# Box plot of tip by day
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='tip', data=tips)
plt.title('Box Plot of Tip by Day')
plt.xlabel('Day')
plt.ylabel('Tip')
plt.show()


In [None]:
# Scatter plot of total_bill vs. tip
plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_bill', y='tip', data=tips, label='Data Points')
plt.title('Scatter Plot and Linear Fit of Total Bill vs. Tip')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.legend()
plt.show()


In [None]:
def least_squares_slope(points):
    n = len(points)
    sum_x = sum(point[0] for point in points)
    sum_y = sum(point[1] for point in points)
    sum_x_squared = sum(point[0]**2 for point in points)
    sum_xy = sum(point[0] * point[1] for point in points)

    # Calculate the slope (m)
    slope = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x**2)

    # Calculate the intercept (b)
    intercept = (sum_y - slope * sum_x) / n

    return slope, intercept


In [None]:
# Convert to a list of (x, y) points where x = total_bill and y = tip
points = list(zip(tips['total_bill'], tips['tip']))

slope, intercept = least_squares_slope(points)
print("Slope:", slope)
print("Intercept:", intercept)

In [None]:
import numpy as np

# Create a line based on the slope and intercept
x_vals = np.array(tips['total_bill'])
y_vals = slope * x_vals + intercept

# Plot the regression line
plt.plot(x_vals, y_vals, color='red', label='Best fit line')

# Add labels and legend
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.title("Total Bill vs Tip with Least Squares Regression Line")
plt.legend()

plt.show()

In [None]:
# Box plot of total_bill by time (Lunch/Dinner)
plt.figure(figsize=(10, 6))
sns.boxplot(x='time', y='total_bill', data=tips)
plt.title('Box Plot of Total Bill by Time (Lunch/Dinner)')
plt.xlabel('Time')
plt.ylabel('Total Bill')
plt.show()


In [None]:
# Box plot of tip by time (Lunch/Dinner)
plt.figure(figsize=(10, 6))
sns.boxplot(x='time', y='tip', data=tips)
plt.title('Box Plot of Tip by Time (Lunch/Dinner)')
plt.xlabel('Time')
plt.ylabel('Tip')
plt.show()


In [None]:
# Box plot of total_bill by sex
plt.figure(figsize=(10, 6))
sns.boxplot(x='sex', y='total_bill', data=tips)
plt.title('Box Plot of Total Bill by Sex')
plt.xlabel('Sex')
plt.ylabel('Total Bill')
plt.show()


In [None]:
# Box plot of tip by sex
plt.figure(figsize=(10, 6))
sns.boxplot(x='sex', y='tip', data=tips)
plt.title('Box Plot of Tip by Sex')
plt.xlabel('Sex')
plt.ylabel('Tip')
plt.show()


In [None]:
# Box plot of total_bill by smoker status
plt.figure(figsize=(10, 6))
sns.boxplot(x='smoker', y='total_bill', data=tips)
plt.title('Box Plot of Total Bill by Smoker Status')
plt.xlabel('Smoker Status')
plt.ylabel('Total Bill')
plt.show()


In [None]:
# Box plot of tip by smoker status
plt.figure(figsize=(10, 6))
sns.boxplot(x='smoker', y='tip', data=tips)
plt.title('Box Plot of Tip by Smoker Status')
plt.xlabel('Smoker Status')
plt.ylabel('Tip')
plt.show()


## Linear Regression
Linear regression is a fundamental statistical and machine learning technique used to model the relationship between a dependent variable (target) and one or more independent variables (features). The goal is to find the best-fitting line (or hyperplane in higher dimensions) that predicts the target variable based on the features.

### Key Concepts

1. **Simple Linear Regression**: Involves one independent variable and one dependent variable. The relationship is modeled as a straight line:
   $$ y = \beta_0 + \beta_1 x + \epsilon $$
   - \( y \): Dependent variable (target)
   - \( x \): Independent variable (feature)
   - \( \beta_0 \): Intercept (the value of \( y \) when \( x = 0 \))
   - \( \beta_1 \): Slope (the change in \( y \) for a one-unit change in \( x \))
   - \( \epsilon \): Error term (captures the deviation of the observed values from the predicted values)

2. **Multiple Linear Regression**: Involves multiple independent variables. The relationship is modeled as:
   $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon $$
   - \( x_1, x_2, \ldots, x_n \): Independent variables (features)
   - \( \beta_1, \beta_2, \ldots, \beta_n \): Coefficients for each feature

### Assumptions
Linear regression makes several key assumptions:
- **Linearity**: The relationship between the independent and dependent variables is linear.
- **Independence**: Observations are independent of each other.
- **Homoscedasticity**: The residuals (errors) have constant variance at every level of the independent variable.
- **Normality**: The residuals are normally distributed.

### Model Training
Training a linear regression model involves finding the best-fitting line by minimizing the sum of the squared differences between the observed values and the predicted values (least squares method).

### Evaluation Metrics
Common metrics to evaluate the performance of a linear regression model include:
- **Mean Squared Error (MSE)**: The average of the squared differences between the observed and predicted values.
- **R-squared (R²)**: The proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating a better fit.


In [None]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


# Load the tips dataset from seaborn
tips = sns.load_dataset('tips')

# Display the first few rows of the dataset
print(tips.head())

# Define the feature (X) and target (y)
X = tips[['total_bill']]
y = tips['tip']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate mean squared error and R^2 score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")

# Display the coefficients
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot of total_bill vs. tip
plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_bill', y='tip', data=tips, label='Data Points')

# Fit a linear regression model and plot the line
sns.regplot(x='total_bill', y='tip', data=tips, scatter=False, label='Linear Fit', color='red')

# Add labels and title
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Scatter Plot and Linear Fit of Total Bill vs. Tip')
plt.legend()

# Show the plot
plt.show()

## Exercise
Analyze the dataset mpg.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the mpg dataset
mpg = sns.load_dataset("mpg").dropna()  # Drop rows with missing values

# Scatter plot with linear regression line
plt.figure(figsize=(8, 6))
sns.regplot(x="horsepower", y="mpg", data=mpg, color="blue", line_kws={"color": "red"})

# Add labels and title
plt.xlabel("Horsepower")
plt.ylabel("Miles per Gallon (mpg)")
plt.title("Linear Regression of MPG vs Horsepower")

plt.show()
