Okay, I will break down the "Python Cheat Sheet for Data Analysis" PDF, explain each section with Python code examples runnable in Google Colab, and then provide 50 practice questions.

---
### Python Cheat Sheet for Data Analysis: Explained

This cheat sheet covers the essential steps and Python commands for a typical data analysis workflow, from loading data to model development and evaluation.

---
### 1. Data Loading
This section focuses on importing data into a Pandas DataFrame, which is the primary data structure used for data analysis in Python.


```markdown
#### Reading a CSV dataset
Pandas provides the `read_csv()` function to load data from a Comma-Separated Values (CSV) file into a DataFrame.
- You can specify if the file has a header or assign column names.
```

**Code Cell 1: Reading CSV and Basic Inspection**

In [None]:
import pandas as pd
import numpy as np # numpy is often needed for numerical operations and handling NaN

# Create a dummy CSV file for demonstration
csv_data_with_header = """col1,col2,col3
1,a,?
2,b,5
3,?,7
4,d,9"""
with open('sample_with_header.csv', 'w') as f:
    f.write(csv_data_with_header)

csv_data_no_header = """10,x,11
20,y,15
30,z,17
40,w,19"""
with open('sample_no_header.csv', 'w') as f:
    f.write(csv_data_no_header)

# Load CSV using the first row as header (default behavior if header=0 or not specified)
df_with_header = pd.read_csv('sample_with_header.csv')
print("DataFrame with header:")
print(df_with_header)

# Load CSV without a header, pandas will assign default integer headers (0, 1, 2...)
df_no_header_auto = pd.read_csv('sample_no_header.csv', header=None)
print("\nDataFrame without header (auto-assigned column names):")
print(df_no_header_auto)

# Assign custom header names if the CSV doesn't have them
headers = ["ID", "Category", "Value"]
df_custom_headers = pd.read_csv('sample_no_header.csv', names=headers) # 'names' is preferred over 'header=None' then df.columns
print("\nDataFrame with custom assigned headers:")
print(df_custom_headers)


```markdown
#### Inspecting the DataFrame
Once loaded, you can inspect the DataFrame:
- `df.head(n)`: Shows the first `n` rows (default is 5).
- `df.tail(n)`: Shows the last `n` rows (default is 5).
- `df.columns`: Shows the column names. Can also be used to assign new column names.
- `df.dtypes`: Shows the data type of each column.
- `df.describe()`: Provides descriptive statistics (count, mean, std, min, max, quartiles) for numerical columns. `include="all"` shows stats for object/categorical columns too.
- `df.info()`: Gives a concise summary of the DataFrame, including data types, non-null values, and memory usage.
```

**Code Cell 2: Inspecting DataFrame**

In [None]:
# Using df_with_header from the previous cell for inspection
print("\nFirst 3 rows of df_with_header:")
print(df_with_header.head(3))

print("\nLast 2 rows of df_with_header:")
print(df_with_header.tail(2))

print("\nColumn names of df_with_header:")
print(df_with_header.columns)

# Example of assigning new header names (though df_with_header already has them)
# df_with_header.columns = ['NewUrl1', 'NewUrl2', 'NewUrl3']
# print("\nAfter renaming columns:")
# print(df_with_header.columns)

print("\nData types of df_with_header:")
print(df_with_header.dtypes)

print("\nStatistical description of df_with_header (numerical columns by default):")
# Convert 'col3' to numeric, as '?' might make it object type.
# First, replace '?' with NaN
df_with_header_cleaned = df_with_header.replace("?", np.nan)
df_with_header_cleaned['col3'] = pd.to_numeric(df_with_header_cleaned['col3'])
print(df_with_header_cleaned.describe())

print("\nStatistical description including all attributes (object types too):")
print(df_with_header_cleaned.describe(include="all"))

print("\nSummary info of df_with_header_cleaned:")
df_with_header_cleaned.info()


```markdown
#### Handling Specific Values and Saving
- `df.replace("?", np.nan)`: Replaces specific placeholders (like "?") with NumPy's `NaN` (Not a Number), which is Pandas' standard way of representing missing values.
- `df.to_csv(<output_path>)`: Saves the DataFrame to a CSV file. `index=False` is often used to avoid writing the DataFrame index as a column in the CSV.
```

**Code Cell 3: Replacing Values and Saving**

In [None]:
# df_with_header_cleaned already has '?' replaced with NaN from the previous cell.
print("\nDataFrame after replacing '?' with NaN (df_with_header_cleaned):")
print(df_with_header_cleaned)

# Save the cleaned DataFrame to a new CSV file
output_csv_path = 'cleaned_data.csv'
df_with_header_cleaned.to_csv(output_csv_path, index=False) # index=False avoids writing df index as a column
print(f"\nCleaned DataFrame saved to {output_csv_path}")

# You can verify by reading it back
df_reloaded = pd.read_csv(output_csv_path)
print("\nReloaded DataFrame:")
print(df_reloaded)

---
### 2. Data Wrangling
This involves preprocessing the data to handle issues like missing values, incorrect data types, and transforming data into a more suitable format for analysis or modeling.

**Markdown Cell:**
```markdown
#### Handling Missing Data
Missing data can be handled in several ways:
- **Replace with most frequent value:** Useful for categorical columns.
- **Replace with mean/median:** Useful for numerical columns. Median is often preferred if the data has outliers.
`inplace=True` modifies the DataFrame directly.
```

**Code Cell 4: Handling Missing Data**

In [None]:
# Create a sample DataFrame with missing values
data_missing = {'colA': ['X', 'Y', 'X', 'Z', np.nan, 'Y', 'X'],
                'colB': [10, 20, np.nan, 30, 40, 20, np.nan],
                'colC': [100, 110, 120, 100, np.nan, 110, 100]}
df_missing = pd.DataFrame(data_missing)
print("Original DataFrame with missing values:")
print(df_missing)

# Replace missing data in 'colA' (categorical) with the most frequent entry
most_frequent_colA = df_missing['colA'].value_counts().idxmax()
df_missing['colA'].replace(np.nan, most_frequent_colA, inplace=True) # Modifies df_missing directly
print("\nDataFrame after replacing NaN in 'colA' with most frequent:")
print(df_missing)

# Replace missing data in 'colB' (numerical) with the mean
# Ensure colB is numeric before calculating mean (it should be if NaNs are np.nan)
average_value_colB = df_missing['colB'].astype(float).mean(axis=0) # axis=0 for column mean
df_missing['colB'].replace(np.nan, average_value_colB, inplace=True)
print("\nDataFrame after replacing NaN in 'colB' with mean:")
print(df_missing)

# Alternative using fillna() which is often preferred
# df_missing['colC'].fillna(df_missing['colC'].median(), inplace=True) # Example with median for colC
# print("\nDataFrame after replacing NaN in 'colC' with median using fillna():")
# print(df_missing)

**Markdown Cell:**
```markdown
#### Fixing Data Types
Ensure columns have the correct data types for analysis (e.g., numbers stored as strings should be converted to numeric types).
```

**Code Cell 5: Fixing Data Types**

In [None]:
# Example: df_with_header_cleaned['col3'] was already converted in a previous cell.
# Let's assume we have another column that should be an integer but is an object.
df_types = pd.DataFrame({'ID': [1, 2, 3], 'ValueStr': ['100', '250', '80']})
print("\nOriginal DataFrame with 'ValueStr' as object:")
print(df_types)
print(df_types.dtypes)

df_types['ValueStr'] = df_types['ValueStr'].astype(int)
print("\nDataFrame after converting 'ValueStr' to int:")
print(df_types)
print(df_types.dtypes)

# Multiple columns at once
# df_types[['col_num1', 'col_num2']] = df_types[['col_num1', 'col_num2']].astype(float)

**Markdown Cell:**
```markdown
#### Data Normalization (Simple Max Scaling)
Normalization scales numerical data to a common range, like [0, 1]. One simple method is dividing by the maximum value in the column.
```

**Code Cell 6: Data Normalization (Max Scaling)**

In [None]:
df_normalize = pd.DataFrame({'Score': [10, 20, 5, 15, 20]})
print("\nOriginal DataFrame for normalization:")
print(df_normalize)

df_normalize['Score_Normalized'] = df_normalize['Score'] / df_normalize['Score'].max()
print("\nDataFrame after max normalization:")
print(df_normalize)
# Note: Other common normalization methods include Min-Max scaling and Z-score standardization.

**Markdown Cell:**
```markdown
#### Binning
Binning groups continuous numerical data into discrete "bins" or categories.
```

**Code Cell 7: Binning**

In [None]:
df_binning = pd.DataFrame({'Age': [22, 25, 31, 45, 52, 23, 38, 60, 29]})
print("\nOriginal DataFrame for binning:")
print(df_binning)

# Define bin edges
# np.linspace creates evenly spaced numbers over a specified interval.
# Here, 3 bins mean 4 edges.
bins = np.linspace(min(df_binning['Age']), max(df_binning['Age']), 4) # 4 edges for 3 bins
print("\nBin edges:", bins)

group_names = ['Young', 'Middle-aged', 'Senior']

df_binning['Age_Group'] = pd.cut(df_binning['Age'], bins, labels=group_names, include_lowest=True)
print("\nDataFrame after binning 'Age':")
print(df_binning)

**Markdown Cell:**
```markdown
#### Changing Column Names
Use `df.rename()` to change column names.
```

**Code Cell 8: Changing Column Names**

In [None]:
df_rename = pd.DataFrame({'old_col_1': [1,2], 'old_col_2': [3,4]})
print("\nOriginal DataFrame for renaming:")
print(df_rename)

df_rename.rename(columns={'old_col_1': 'new_column_A', 'old_col_2': 'new_column_B'}, inplace=True)
print("\nDataFrame after renaming columns:")
print(df_rename)

**Markdown Cell:**
```markdown
#### Indicator Variables (One-Hot Encoding)
Converts categorical variables into a set of binary (0 or 1) columns, one for each category. This is essential for many machine learning algorithms.
```

**Code Cell 9: Indicator Variables (One-Hot Encoding)**

In [None]:
df_indicator = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})
print("\nOriginal DataFrame for indicator variables:")
print(df_indicator)

# Create dummy variables for the 'Color' column
dummy_variables_color = pd.get_dummies(df_indicator['Color'], prefix='Color') # prefix is optional but good practice
print("\nDummy variables created:")
print(dummy_variables_color)

# Concatenate the new dummy variable columns to the original DataFrame
df_indicator = pd.concat([df_indicator, dummy_variables_color], axis=1)
# Optionally, drop the original categorical column
# df_indicator.drop('Color', axis=1, inplace=True)
print("\nDataFrame after adding dummy variables:")
print(df_indicator)

---
### 3. Exploratory Data Analysis (EDA)
This stage involves examining the data to find patterns, anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.

**Markdown Cell:**
```markdown
#### Correlation
Correlation measures the statistical relationship between two variables.
- `df.corr()`: Computes pairwise correlation of all numerical columns.
- `df[['col1', 'col2']].corr()`: Computes correlation between specified columns.
Correlation coefficients range from -1 to +1.
```

**Code Cell 10: Correlation**

In [None]:
# Create a sample DataFrame for EDA
data_eda = {
    'EngineSize': [100, 120, 110, 150, 90, 130],
    'Horsepower': [70, 90, 80, 110, 60, 100],
    'Price': [10000, 15000, 12000, 20000, 9000, 17000],
    'Category': ['A', 'B', 'A', 'C', 'A', 'B']
}
df_eda = pd.DataFrame(data_eda)
print("Sample DataFrame for EDA:")
print(df_eda)

print("\nComplete DataFrame correlation (numerical columns):")
print(df_eda.corr(numeric_only=True)) # numeric_only=True to avoid warnings with mixed types

print("\nCorrelation between 'EngineSize' and 'Price':")
print(df_eda[['EngineSize', 'Price']].corr())

**Markdown Cell:**
```markdown
#### Visualization
Visual plots are key to understanding data.
- **Scatter Plot (`plt.scatter`)**: Shows the relationship between two numerical variables.
- **Regression Plot (`sns.regplot`)**: A scatter plot with a linear regression line fitted to the data. Useful for visualizing linear relationships.
- **Box Plot (`sns.boxplot`)**: Displays the distribution of a numerical variable, showing median, quartiles, and potential outliers. Often used to compare distributions across categories.
Matplotlib (`plt`) and Seaborn (`sns`) are common plotting libraries.
```

**Code Cell 11: Scatter Plot, Regression Plot, Box Plot**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot: Horsepower vs Price
plt.figure(figsize=(6,4)) # Create a new figure
plt.scatter(df_eda['Horsepower'], df_eda['Price'])
plt.xlabel("Horsepower")
plt.ylabel("Price")
plt.title("Scatter Plot: Horsepower vs. Price")
plt.show() # Display the plot

# Regression plot: EngineSize vs Price
plt.figure(figsize=(6,4))
sns.regplot(x='EngineSize', y='Price', data=df_eda)
plt.title("Regression Plot: EngineSize vs. Price")
plt.show()

# Box plot: Price distribution by Category
plt.figure(figsize=(6,4))
sns.boxplot(x='Category', y='Price', data=df_eda)
plt.title("Box Plot: Price by Category")
plt.show()

**Markdown Cell:**
```markdown
#### Grouping Data
`df.groupby()` allows you to group data based on some criteria and then apply an aggregate function (like mean, sum, count).
- Can group by a single attribute or multiple attributes.
`as_index=False` keeps the grouping keys as columns rather than setting them as the index.
```

**Code Cell 12: GroupBy Statements**

In [None]:
# Group by a single attribute ('Category') and calculate the mean of other numerical columns
df_group_single = df_eda[['Category', 'Price', 'Horsepower']] # Select relevant columns
grouped_by_category_mean = df_group_single.groupby(['Category'], as_index=False).mean()
print("\nMean Price and Horsepower grouped by Category:")
print(grouped_by_category_mean)

# Create a more diverse category for multiple grouping example
df_eda['SubCategory'] = ['S1', 'S1', 'S2', 'S1', 'S2', 'S2']

# Group by multiple attributes ('Category', 'SubCategory')
df_group_multiple = df_eda[['Category', 'SubCategory', 'Price']]
grouped_by_multi_mean = df_group_multiple.groupby(['Category', 'SubCategory'], as_index=False).mean()
print("\nMean Price grouped by Category and SubCategory:")
print(grouped_by_multi_mean)

**Markdown Cell:**
```markdown
#### Pivot Tables
Pivot tables reshape data, allowing you to summarize and aggregate it with one variable along the rows, another along the columns, and values in the cells.
```

**Code Cell 13: Pivot Tables**

In [None]:
# Using the grouped_by_multi_mean DataFrame to create a pivot table
# Let's make sure we have a suitable structure for pivot
# For example, let's use df_eda and average price by Category and SubCategory
# The df_group_multiple used above already has 'Category', 'SubCategory', 'Price'
# Let's say we want 'Category' as index, 'SubCategory' as columns, and 'Price' (mean) as values.

# We first need to group and get the mean if we didn't have grouped_by_multi_mean
grouped_for_pivot = df_eda.groupby(['Category', 'SubCategory'], as_index=False)['Price'].mean()
print("\nData prepared for pivot table (average price):")
print(grouped_for_pivot)

# Create the pivot table
pivot_table_result = grouped_for_pivot.pivot(index='Category', columns='SubCategory', values='Price')
print("\nPivot Table (Category as index, SubCategory as columns, Avg Price as values):")
print(pivot_table_result)

**Markdown Cell:**
```markdown
#### Heatmap / Pseudocolor Plot
A heatmap visually represents matrix-like data where values are depicted by color intensity. `plt.pcolor()` can be used with pivot table data.
```

**Code Cell 14: Pseudocolor Plot (Heatmap of Pivot Table)**

In [None]:
# Using the pivot_table_result from the previous cell
plt.figure(figsize=(7,5))
plt.pcolor(pivot_table_result, cmap='RdBu') # RdBu is a Red-Blue colormap
plt.colorbar(label='Average Price') # Add a color bar to indicate values
plt.yticks(np.arange(0.5, len(pivot_table_result.index), 1), pivot_table_result.index) # Set Y-axis ticks and labels
plt.xticks(np.arange(0.5, len(pivot_table_result.columns), 1), pivot_table_result.columns) # Set X-axis ticks and labels
plt.xlabel("SubCategory")
plt.ylabel("Category")
plt.title("Heatmap of Average Price by Category and SubCategory")
plt.show()

**Markdown Cell:**
```markdown
#### Pearson Coefficient and p-value
The Pearson correlation coefficient measures the linear correlation between two continuous variables. The p-value helps determine the statistical significance of the correlation.
- `scipy.stats.pearsonr(col1, col2)` returns the Pearson coefficient and the p-value.
```

**Code Cell 15: Pearson Coefficient and p-value**

In [None]:
from scipy import stats

# Calculate Pearson correlation between 'Horsepower' and 'Price'
pearson_coef, p_value = stats.pearsonr(df_eda['Horsepower'], df_eda['Price'])

print(f"\nPearson Correlation Coefficient (Horsepower vs Price): {pearson_coef:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("The correlation is statistically significant (p < 0.05).")
else:
    print("The correlation is not statistically significant (p >= 0.05).")

---
### 4. Model Development
This section deals with creating predictive models. The cheat sheet focuses on Linear and Polynomial Regression.

**Markdown Cell:**
```markdown
#### Linear Regression
Linear regression models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation.
- **Simple Linear Regression:** One independent variable.
- **Multiple Linear Regression:** Multiple independent variables.
The `sklearn.linear_model.LinearRegression` class is used.
```

**Code Cell 16: Linear Regression - Object Creation and Training**

In [None]:
from sklearn.linear_model import LinearRegression

# Create a Linear Regression model object
lr = LinearRegression()

# Prepare data for training (using df_eda)
# Single attribute (Simple Linear Regression): Predicting Price from Horsepower
X_simple = df_eda[['Horsepower']] # Features need to be 2D (DataFrame)
Y_target = df_eda['Price']       # Target is 1D (Series)

# Train the Simple Linear Regression model
lr.fit(X_simple, Y_target)
print("Simple Linear Regression model trained.")

# Multiple attributes (Multiple Linear Regression): Predicting Price from Horsepower and EngineSize
X_multiple = df_eda[['Horsepower', 'EngineSize']]
# Y_target is the same

# Create a new lr object for multiple regression or re-train the existing one
lr_multiple = LinearRegression()
lr_multiple.fit(X_multiple, Y_target)
print("\nMultiple Linear Regression model trained.")

**Markdown Cell:**
```markdown
#### Generating Predictions and Model Parameters
- `lr.predict(X)`: Generates predictions for new input `X`.
- `lr.coef_`: Returns the slope coefficient(s) (m). For multiple regression, it's an array.
- `lr.intercept_`: Returns the intercept (c).
The linear model is defined by $Y = mX + c$ (simple) or $Y = m_1X_1 + m_2X_2 + ... + c$ (multiple).
```

**Code Cell 17: Predictions and Model Parameters (Linear Regression)**

In [None]:
# Predictions using the simple linear regression model
Y_hat_simple = lr.predict(X_simple)
print("\nPredictions from Simple Linear Regression (first 3):")
print(Y_hat_simple[:3])

# Identify coefficient and intercept for the simple linear model
coeff_simple = lr.coef_
intercept_simple = lr.intercept_
print(f"Simple LR - Coefficient (slope for Horsepower): {coeff_simple[0]:.2f}") # lr.coef_ is an array
print(f"Simple LR - Intercept: {intercept_simple:.2f}")

# Predictions using the multiple linear regression model
Y_hat_multiple = lr_multiple.predict(X_multiple)
print("\nPredictions from Multiple Linear Regression (first 3):")
print(Y_hat_multiple[:3])

# Identify coefficients and intercept for the multiple linear model
coeffs_multiple = lr_multiple.coef_
intercept_multiple = lr_multiple.intercept_
print(f"Multiple LR - Coefficients (for Horsepower, EngineSize): {coeffs_multiple}")
print(f"Multiple LR - Intercept: {intercept_multiple:.2f}")

**Markdown Cell:**
```markdown
#### Residual Plot
A residual plot shows the residuals (differences between actual and predicted values) on the y-axis and an independent variable (or predicted values) on the x-axis.
It helps to check if the linear model assumptions are met (e.g., residuals are randomly scattered around zero). Patterns in residuals suggest the model might not be a good fit.
Seaborn's `sns.residplot()` is used.
```

**Code Cell 18: Residual Plot**

In [None]:
# Residual plot for the Simple Linear Regression (Price vs Horsepower)
plt.figure(figsize=(7,5))
# sns.residplot(x=df_eda['Horsepower'], y=df_eda['Price'], model=lr) # Can pass the fitted model directly
# Or, if using specific X and Y that were used for fitting:
sns.residplot(x=X_simple['Horsepower'], y=Y_target) # x needs to be 1D for residplot's x usually
plt.title("Residual Plot for Simple Linear Regression (Horsepower vs Price)")
plt.xlabel("Horsepower")
plt.ylabel("Residuals")
plt.show()

# For multiple regression, you typically plot residuals against predicted values or against each feature
# Residuals = Y_target - Y_hat_multiple
# plt.scatter(Y_hat_multiple, Y_target - Y_hat_multiple)
# plt.axhline(0, color='red', linestyle='--')
# plt.xlabel("Predicted Values")
# plt.ylabel("Residuals")
# plt.title("Residual Plot for Multiple Linear Regression (Residuals vs Predicted)")
# plt.show()

**Markdown Cell:**
```markdown
#### Distribution Plot (KDE Plot)
`sns.distplot()` (now often replaced by `sns.kdeplot()` or `sns.histplot()`) visualizes the distribution of a single variable.
Using `hist=False` with `distplot` or just `kdeplot` shows the Kernel Density Estimate, a smoothed representation of the distribution.
This is useful for comparing the distribution of predicted values vs. actual values.
```

**Code Cell 19: Distribution Plot (KDE Plot)**

In [None]:
# Plotting distribution of actual 'Price' and predicted 'Price' from simple LR
plt.figure(figsize=(8,5))
sns.kdeplot(Y_target, label='Actual Price', fill=True, alpha=0.5)
sns.kdeplot(Y_hat_simple, label='Predicted Price (Simple LR)', fill=True, alpha=0.5)
plt.title("Distribution of Actual vs. Predicted Prices (Simple LR)")
plt.xlabel("Price")
plt.ylabel("Density")
plt.legend()
plt.show()

# Note: sns.distplot is deprecated. sns.kdeplot or sns.histplot(kde=True) are alternatives.
# Example using kdeplot (as above) is the modern way.

**Markdown Cell:**
```markdown
#### Polynomial Regression (Single Variable)
Models a non-linear relationship using a polynomial equation of degree `n`.
NumPy's `np.polyfit(x, y, n)` fits a polynomial and returns its coefficients.
`np.poly1d(coefficients)` creates a polynomial function object from these coefficients, which can then be used for prediction.
```

**Code Cell 20: Polynomial Regression (Single Variable with NumPy)**

In [None]:
# Using a slice of the data for a clearer polynomial fit example
x_poly_data = df_eda['EngineSize'].values # NumPy array
y_poly_data = df_eda['Price'].values      # NumPy array

degree = 2 # Degree of the polynomial

# Fit the polynomial model
# f contains the coefficients of the polynomial
coefficients_poly = np.polyfit(x_poly_data, y_poly_data, degree)
print(f"\nPolynomial coefficients (degree {degree}): {coefficients_poly}")

# Create the polynomial model function
p_model = np.poly1d(coefficients_poly)
print(f"Polynomial model equation:\n{p_model}")

# Generate predicted output using the polynomial model
Y_hat_poly = p_model(x_poly_data)

# Plotting the results
plt.figure(figsize=(8,5))
plt.scatter(x_poly_data, y_poly_data, label='Actual Data')
# Sort x values for a smooth line plot
sorted_indices = np.argsort(x_poly_data)
plt.plot(x_poly_data[sorted_indices], Y_hat_poly[sorted_indices], color='red', label=f'Polynomial Fit (degree {degree})')
plt.xlabel("Engine Size")
plt.ylabel("Price")
plt.title("Single Variable Polynomial Regression")
plt.legend()
plt.show()

**Markdown Cell:**
```markdown
#### Multivariate Polynomial Regression (Scikit-learn)
Extends polynomial regression to multiple independent variables.
`sklearn.preprocessing.PolynomialFeatures(degree=n)` generates a new feature matrix consisting of all polynomial combinations of the features up to the specified degree. This transformed feature matrix can then be used with a linear regression model.
```

**Code Cell 21: Multivariate Polynomial Regression (Scikit-learn)**

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Features for multivariate polynomial regression
Z_multi_poly = df_eda[['Horsepower', 'EngineSize']]
# Y_target is the same df_eda['Price']

# Create PolynomialFeatures object
poly_features_transformer = PolynomialFeatures(degree=2, include_bias=False) # include_bias=False is common as LinearRegression handles it

# Transform the original features into polynomial features
Z_multi_poly_transformed = poly_features_transformer.fit_transform(Z_multi_poly)
print(f"\nOriginal features shape: {Z_multi_poly.shape}")
print(f"Transformed polynomial features shape (degree 2): {Z_multi_poly_transformed.shape}")
print("Feature names after transformation (if you use get_feature_names_out):")
print(poly_features_transformer.get_feature_names_out(['Horsepower', 'EngineSize']))

# Now, fit a linear regression model using these transformed features
lr_poly_multi = LinearRegression()
lr_poly_multi.fit(Z_multi_poly_transformed, Y_target)
print("\nMultivariate Polynomial Regression model trained.")

# Predict using the transformed features
# For new data, it must also be transformed using poly_features_transformer.transform()
Y_hat_poly_multi = lr_poly_multi.predict(Z_multi_poly_transformed)
print("Predictions from Multivariate Polynomial Regression (first 3):")
print(Y_hat_poly_multi[:3])

**Markdown Cell:**
```markdown
#### Pipeline
`sklearn.pipeline.Pipeline` allows chaining multiple processing steps (e.g., scaling, polynomial feature creation, model fitting) into a single estimator.
This simplifies workflows, prevents data leakage from test set to training set during preprocessing, and makes grid search over parameters of different steps easier.
Each step is a tuple of `('name', estimator_object)`.
```

**Code Cell 22: Pipeline**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler # For feature scaling

# Define the steps in the pipeline
# 1. Scale data (StandardScaler)
# 2. Create polynomial features (PolynomialFeatures)
# 3. Fit a linear regression model (LinearRegression)
Input_pipeline_steps = [
    ('scale', StandardScaler()),
    ('polynomial', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', LinearRegression())
]

# Create the pipeline object
pipe = Pipeline(Input_pipeline_steps)

# Data for the pipeline (using Z_multi_poly and Y_target from previous examples)
# Ensure Z is float (StandardScaler might require it)
Z_pipe_data = Z_multi_poly.astype(float)
# Y_target is already appropriate

# Fit the entire pipeline to the data
pipe.fit(Z_pipe_data, Y_target)
print("\nPipeline trained.")

# Make predictions using the pipeline
# The pipeline automatically applies all transformations before prediction
Y_pipe_predictions = pipe.predict(Z_pipe_data)
print("Predictions from Pipeline (first 3):")
print(Y_pipe_predictions[:3])

**Markdown Cell:**
```markdown
#### R² value (Coefficient of Determination)
R² measures how well the regression model fits the observed data. It represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- Ranges from 0 to 1 (or can be negative for very poor models). Higher is generally better.
- For Scikit-learn linear models, `model.score(X, Y)` directly returns R².
- For NumPy polynomial models, `sklearn.metrics.r2_score(y_true, y_predicted)` can be used.
```

**Code Cell 23: R² Value**

In [None]:
from sklearn.metrics import r2_score

# R² for Simple Linear Regression (using lr trained earlier)
r2_score_simple_lr = lr.score(X_simple, Y_target)
print(f"\nR² for Simple Linear Regression (Horsepower vs Price): {r2_score_simple_lr:.4f}")

# R² for Multiple Linear Regression (using lr_multiple)
r2_score_multiple_lr = lr_multiple.score(X_multiple, Y_target)
print(f"R² for Multiple Linear Regression: {r2_score_multiple_lr:.4f}")

# R² for Single Variable Polynomial Regression (using NumPy model p_model and its predictions Y_hat_poly)
# y_poly_data is the true y, Y_hat_poly is the predicted y
r2_score_numpy_poly = r2_score(y_poly_data, Y_hat_poly)
print(f"R² for NumPy Polynomial Regression (degree 2): {r2_score_numpy_poly:.4f}")

# R² for Multivariate Polynomial Regression (using lr_poly_multi and Z_multi_poly_transformed)
r2_score_sklearn_poly_multi = lr_poly_multi.score(Z_multi_poly_transformed, Y_target)
print(f"R² for Scikit-learn Multivariate Polynomial Regression: {r2_score_sklearn_poly_multi:.4f}")

# R² for Pipeline model
r2_score_pipeline = pipe.score(Z_pipe_data, Y_target)
print(f"R² for Pipeline model: {r2_score_pipeline:.4f}")

**Markdown Cell:**
```markdown
#### MSE value (Mean Squared Error)
MSE measures the average of the squares of the errors (the difference between actual and estimated values).
Lower MSE indicates a better fit.
`sklearn.metrics.mean_squared_error(y_true, y_predicted)` is used.
```

**Code Cell 24: MSE Value**

In [None]:
from sklearn.metrics import mean_squared_error

# MSE for Simple Linear Regression
mse_simple_lr = mean_squared_error(Y_target, Y_hat_simple)
print(f"\nMSE for Simple Linear Regression: {mse_simple_lr:.2f}")

# MSE for Multiple Linear Regression
mse_multiple_lr = mean_squared_error(Y_target, Y_hat_multiple)
print(f"MSE for Multiple Linear Regression: {mse_multiple_lr:.2f}")

# MSE for NumPy Polynomial Regression
mse_numpy_poly = mean_squared_error(y_poly_data, Y_hat_poly)
print(f"MSE for NumPy Polynomial Regression: {mse_numpy_poly:.2f}")

# MSE for Scikit-learn Multivariate Polynomial Regression
mse_sklearn_poly_multi = mean_squared_error(Y_target, Y_hat_poly_multi)
print(f"MSE for Scikit-learn Multivariate Polynomial Regression: {mse_sklearn_poly_multi:.2f}")

# MSE for Pipeline model
mse_pipeline = mean_squared_error(Y_target, Y_pipe_predictions)
print(f"MSE for Pipeline model: {mse_pipeline:.2f}")

---
### 5. Model Evaluation and Refinement
This involves assessing model performance more robustly and tuning models.

**Markdown Cell:**
```markdown
#### Splitting Data for Training and Testing
It's crucial to evaluate a model on data it hasn't seen during training.
`sklearn.model_selection.train_test_split` splits data into random train and test subsets.
- `test_size`: Proportion of the dataset to include in the test split.
- `random_state`: Ensures reproducibility of the split.
```

**Code Cell 25: Train/Test Split**

In [None]:
from sklearn.model_selection import train_test_split

# Using df_eda as an example dataset
# Let's assume 'Price' is the target and all other numeric columns are features
X_data_full = df_eda[['EngineSize', 'Horsepower']] # Features
Y_data_full = df_eda['Price']                     # Target

# Split data: 10% for testing, 90% for training, random_state for reproducibility
x_train, x_test, y_train, y_test = train_test_split(X_data_full, Y_data_full, test_size=0.20, random_state=42) # Test size 0.2 for more data

print(f"\nShape of x_train: {x_train.shape}, y_train: {y_train.shape}")
print(f"Shape of x_test: {x_test.shape}, y_test: {y_test.shape}")

# Now you would train your model on x_train, y_train
# And evaluate it on x_test, y_test
lr_eval = LinearRegression()
lr_eval.fit(x_train, y_train)
r2_test = lr_eval.score(x_test, y_test)
print(f"R² score on the test set: {r2_test:.4f}")

**Markdown Cell:**
```markdown
#### Cross-Validation Score
Cross-validation provides a more robust measure of model performance by training and testing the model on different subsets (folds) of the data.
`sklearn.model_selection.cross_val_score` performs K-fold cross-validation.
- `cv=n`: Number of folds.
It returns an array of scores (e.g., R²) for each fold. The mean and standard deviation of these scores give an overall performance estimate.
```

**Code Cell 26: Cross-Validation Score**

In [None]:
from sklearn.model_selection import cross_val_score

lre_cv = LinearRegression() # Model instance

# Perform 3-fold cross-validation on X_data_full, Y_data_full
# The scoring parameter can be specified, default for LinearRegression is R²
# Using X_data_full[['EngineSize']] as an example with one feature for simplicity in the cheat sheet's style
rcross_scores = cross_val_score(lre_cv, X_data_full[['EngineSize']], Y_data_full, cv=3) # cv=n, number of folds

print(f"\nCross-validation R² scores for each fold: {rcross_scores}")
print(f"Mean R² from cross-validation: {rcross_scores.mean():.4f}")
print(f"Standard deviation of R² from cross-validation: {rcross_scores.std():.4f}")

**Markdown Cell:**
```markdown
#### Cross-Validation Prediction
`sklearn.model_selection.cross_val_predict` generates predictions for each data point by training the model on the other folds.
This is useful for visualizing model performance or for creating out-of-sample predictions for further analysis.
```

**Code Cell 27: Cross-Validation Prediction**

In [None]:
from sklearn.model_selection import cross_val_predict

lre_cv_pred = LinearRegression()

# Get cross-validated predictions
# Using X_data_full[['EngineSize']] as an example with one feature
yhat_cv = cross_val_predict(lre_cv_pred, X_data_full[['EngineSize']], Y_data_full, cv=3)

print("\nFirst 5 cross-validated predictions:")
print(yhat_cv[:5])

# These predictions can be compared to Y_data_full to assess out-of-sample performance
r2_cv_pred = r2_score(Y_data_full, yhat_cv)
print(f"R² score using cross-validated predictions: {r2_cv_pred:.4f}")

**Markdown Cell:**
```markdown
#### Ridge Regression
Ridge Regression is a linear regression model that includes L2 regularization.
Regularization adds a penalty term to the loss function to shrink coefficient magnitudes, helping to prevent overfitting, especially when dealing with multicollinearity or many features.
- `alpha`: The regularization strength. Higher alpha means stronger regularization (more shrinkage).
It's often used with polynomial features to control their complexity.
```

**Code Cell 28: Ridge Regression**

In [None]:
from sklearn.linear_model import Ridge

# Assume x_train, x_test, y_train, y_test are from the train_test_split earlier
# And we want to use polynomial features with Ridge regression

# Create polynomial features
poly_ridge_transformer = PolynomialFeatures(degree=2, include_bias=False)
x_train_poly_ridge = poly_ridge_transformer.fit_transform(x_train)
x_test_poly_ridge = poly_ridge_transformer.transform(x_test) # Use transform on test data, not fit_transform

# Create and train Ridge Regression model
# Alpha is the regularization strength; larger values specify stronger regularization.
ridge_model = Ridge(alpha=1.0) # Common default for alpha
ridge_model.fit(x_train_poly_ridge, y_train)
print("\nRidge Regression model trained with polynomial features.")

# Make predictions
yhat_ridge = ridge_model.predict(x_test_poly_ridge)
print("First 3 predictions from Ridge Regression:")
print(yhat_ridge[:3])

# Evaluate the Ridge model
r2_ridge = ridge_model.score(x_test_poly_ridge, y_test)
print(f"R² score for Ridge Regression on test set: {r2_ridge:.4f}")

**Markdown Cell:**
```markdown
#### Grid Search
`sklearn.model_selection.GridSearchCV` systematically searches for the best combination of hyperparameter values for a model.
It tries all specified combinations using cross-validation and selects the one that performs best on average.
- `parameters`: A dictionary or list of dictionaries where keys are parameter names and values are lists of settings to try.
- `cv`: Number of cross-validation folds.
`.best_estimator_` gives the model instance with the best found parameters.
```

**Code Cell 29: Grid Search for Ridge Regression Alpha**

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Ridge's alpha
# These are the alpha values that GridSearchCV will test
parameters_grid = [{'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}]

# Create a Ridge regression object
RR_grid = Ridge()

# Create GridSearchCV object
# It will search for the best alpha from parameters_grid using 4-fold CV
Grid1 = GridSearchCV(RR_grid, parameters_grid, cv=4, scoring='r2') # Can specify scoring metric

# Fit GridSearchCV to the polynomial training data (using data from Ridge example)
# For GridSearch, it's common to use the full training set (x_train_poly_ridge) or even the entire dataset (X_data_full_poly)
# if the final model selection includes retraining on all available data after finding best params.
# Here, let's use the polynomial transformed training data.
Grid1.fit(x_train_poly_ridge, y_train) # Using the polynomial features from Ridge example

print("\nGridSearchCV fitting completed.")

# Get the best estimator (model with best parameters) found by GridSearchCV
BestRR = Grid1.best_estimator_
print(f"Best alpha found by GridSearchCV: {BestRR.alpha}")

# Evaluate the best model on the (polynomial transformed) test set
r2_score_best_ridge = BestRR.score(x_test_poly_ridge, y_test)
print(f"R² score of the best Ridge model on test set: {r2_score_best_ridge:.4f}")

---
### Practice Questions (50)

**Data Loading**

1.  What Pandas function is used to read a CSV file?
2.  How do you read a CSV file that does not contain a header row, assigning default integer column names?
3.  How can you specify custom column names when reading a CSV file that has no header?
4.  What does `df.head(7)` do?
5.  What does `df.tail()` return by default?
6.  How can you get a list of all column names in a DataFrame `df`?
7.  What method would you use to change all column names of `df` to a new list `new_headers`?
8.  How do you replace all occurrences of the string "-" with `np.nan` in a DataFrame `df`?
9.  What information does `df.dtypes` provide?
10. What is the difference between `df.describe()` and `df.describe(include="all")`?
11. What does `df.info()` show?
12. How do you save a DataFrame `df` to a CSV file named 'output.csv' without writing the index?

**Data Wrangling**

13. How can you replace missing values (`np.nan`) in a column 'Age' with its median?
14. How do you find the most frequent value in a categorical column 'Category'?
15. How do you convert a column 'PriceString' (containing numbers as strings) to a float data type?
16. What is the purpose of data normalization? Explain one simple normalization technique.
17. How can you use `np.linspace` and `pd.cut` to bin a numerical column 'Score' into 5 equal-width bins?
18. How would you rename a column 'OldName' to 'NewName' in DataFrame `df` permanently?
19. What are indicator variables (one-hot encoding), and why are they useful?
20. What Pandas function creates one-hot encoded columns from a categorical column?

**Exploratory Data Analysis (EDA)**

21. How do you compute the pairwise correlation of all numerical columns in a DataFrame `df`?
22. How would you get the correlation only between columns 'A' and 'B' from `df`?
23. What type of plot is best for visualizing the relationship between two numerical variables to see if they cluster or follow a trend?
24. What does `sns.regplot(x='feature', y='target', data=df)` show?
25. What information does a box plot (`sns.boxplot`) convey about a variable's distribution?
26. How can you group a DataFrame `df` by a column 'Region' and then calculate the average 'Sales' for each region?
27. What is a pivot table, and what is it used for in data analysis?
28. How can you visually represent a pivot table's data using colors (like a heatmap)?
29. What two values does `scipy.stats.pearsonr()` return?
30. If `pearsonr()` returns a p-value of 0.001 for two variables, what does this suggest about their linear correlation?

**Model Development**

31. What Scikit-learn class is used for creating a linear regression model?
32. How do you train a linear regression model `lr` using features `X` and target `Y`?
33. After training, how do you get the predicted values `Y_hat` for input features `X_test` using model `lr`?
34. What attributes of a fitted `LinearRegression` object store the slope coefficient(s) and the intercept?
35. What is a residual plot, and what does a random scatter of points around zero in a residual plot indicate?
36. Which Seaborn function is primarily recommended now for plotting a Kernel Density Estimate of a variable (replacing `distplot` for this specific purpose)?
37. How can you perform single-variable polynomial regression of degree 3 for `x` and `y` using NumPy functions?
38. What is the role of `sklearn.preprocessing.PolynomialFeatures`?
39. What is a Scikit-learn `Pipeline` and what is one of its key benefits?
40. Name three common steps that might be included in a machine learning `Pipeline`.

**Model Evaluation and Refinement**

41. What is the purpose of splitting data into training and testing sets? Which Scikit-learn function is used for this?
42. What does `test_size=0.3` signify in `train_test_split`?
43. Explain K-fold cross-validation. What Scikit-learn function performs this and returns scores for each fold?
44. If `cross_val_score` returns `[0.8, 0.85, 0.75]`, what do these values represent, and how would you get an overall performance measure?
45. What does `cross_val_predict` return, and how is it different from `cross_val_score`?
46. What is Ridge Regression, and what problem does its `alpha` parameter help address?
47. How does a larger `alpha` value affect a Ridge Regression model?
48. What is `GridSearchCV` used for in model tuning?
49. What does the `.best_estimator_` attribute of a fitted `GridSearchCV` object contain?
50. What does R² (Coefficient of Determination) measure? What does an MSE (Mean Squared Error) of 0 indicate?

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/Karthik311099/DSA-Assignments">https://github.com/Karthik311099/DSA-Assignments</a></li>
  <li><a href="https://www.qconcursos.com/questoes-de-concursos/questoes/bdf148ad-c7">https://www.qconcursos.com/questoes-de-concursos/questoes/bdf148ad-c7</a></li>
  <li><a href="https://github.com/r-rodri/ImportedCars">https://github.com/r-rodri/ImportedCars</a></li>
  <li><a href="https://github.com/haininhhoang94/funix-data-science">https://github.com/haininhhoang94/funix-data-science</a></li>
  </ol>
</div>