1. What does R-squared represent in a regression model?

R-squared: This represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It indicates how well the regression model fits the data. A higher R-squared value means a better fit.
2. What are the assumptions of linear regression?
Assumptions of Linear Regression:

Linearity: The relationship between the independent and dependent variables is linear.

Independence: Observations are independent of each other.

Homoscedasticity: The variance of residuals is constant across all levels of the independent variable.

Normality: Residuals (errors) of the model are normally distrubed.



3. What is the difference between R-squared and Adjusted R-squared?

R-squared vs. Adjusted R-squared: R-squared measures the proportion of variance explained by the model. Adjusted R-squared adjusts this value based on the number of predictors, correcting for the model's complexity.



4. Why do we use Mean Squared Error (MSE)?   
Mean Squared Error (MSE): This measures the average squared difference between observed and predicted values. It helps evaluate the accuracy of a model by penalizing larger errors more heavily.


 5. What does an Adjusted R-squared value of 0.85 indicate?
 Adjusted R-squared value of 0.85: Indicates that approximately 85% of the variance in the dependent variable is explained by the independent variables, adjusted for the number of predictors in the model.

6. How do we check for normality of residuals in linear regression? Checking Normality of Residuals: Use visual tools like Q-Q plots or statistical tests like the Shapiro-Wilk test. The residuals should be approximately normally distributed.

7. What is multicollinearity, and how does it impact regression?
Multicollinearity: This occurs when independent variables in a regression model are highly correlated, causing issues with the stability and interpretability of the model. It can inflate the variance of coefficient estimates

 8. What is Mean Absolute Error (MAE)?
 Mean Absolute Error (MAE): This measures the average absolute difference between observed and predicted values. Unlike MSE, it does not penalize larger errors more heavily.

9. What are the benefits of using an ML pipeline?
Benefits of Using an ML Pipeline:

Streamlines the workflow by integrating data preprocessing, model training, and evaluation.

Enhances reproducibility and consistency.

Simplifies model management and deployment

 10. Why is RMSE considered more interpretable than MSE?

 RMSE vs. MSE: RMSE (Root Mean Squared Error) is the square root of MSE and is in the same units as the dependent variable. It is considered more interpretable because it directly relates to the scale of the data.

11. What is pickling in Python, and how is it useful in ML?

Pickling in Python: Pickling serializes Python objects into a byte stream, allowing them to be saved to a file and later loaded back. It is useful in ML for saving trained models and other objects.









12. What does a high R-squared value mean?
High R-squared Value: Indicates that a large proportion of the variance in the dependent variable is explained by the independent variables in the model, suggesting a good fit.




 13. What happens if linear regression assumptions are violated?
 Violating Linear Regression Assumptions: Can lead to biased estimates, inefficiency, and invalid inference. For example, if residuals are not normally distributed, hypothesis tests may not be valid.



  14. How can we address multicollinearity in regression?

  Addressing Multicollinearity:

Remove or combine highly correlated predictors.

Use regularization techniques like Ridge or Lasso regression.

Apply Principal Component Analysis (PCA) to reduce dimensionality.


  
   15. How can feature selection improve model performance in regression analysis?

   Feature Selection in Regression Analysis: Improves model performance by removing irrelevant or redundant predictors, reducing overfitting, and enhancing model interpretability.



16. How is Adjusted R-squared calculated?
Adjusted R-squared Calculation:

Adjusted R
2
=
1
−
(
1
−
𝑅
2
)
(
𝑛
−
1
)
𝑛
−
𝑘
−
1
where
𝑛
 is the number of observations and
𝑘
 is the number of predictors.



17. Why is MSE sensitive to outliers?
MSE Sensitivity to Outliers: MSE squares the errors, which means larger errors have a disproportionately higher impact, making it sensitive to outliers.



 18. What is the role of homoscedasticity in linear regression?
 Homoscedasticity in Linear Regression: Assumes that the variance of residuals is constant across levels of the independent variable. If violated (heteroscedasticity), it can affect the efficiency and validity of the model.



 19. What is Root Mean Squared Error (RMSE)?
 Root Mean Squared Error (RMSE): Measures the average magnitude of the residuals, giving an indication of the absolute fit of the model to the data.




  20. Why is pickling considered risky?
  Risks of Pickling: Security risks if loading pickled data from untrusted sources, as it can execute arbitrary code during deserialization.


  
  21. What alternatives exist to pickling for saving ML models?
  Alternatives to Pickling for Saving ML Models:

Joblib: Efficiently saves large data objects.

HDF5: A file format and set of tools for storing complex data.

ONNX: Open Neural Network Exchange format for interoperability.


  
   22. What is heteroscedasticity, and why is it a problem?
   Heteroscedasticity: When the variance of residuals is not constant. It can lead to inefficient estimates and incorrect conclusions in hypothesis testing
    23. How can interaction terms enhance a regression model's predictive power?
    Interaction Terms: Allow the model to capture complex relationships between predictors, enhancing predictive power and interpretability.

In [None]:
##practical ##question

1 Write a Python script to visualize the distribution of errors (residuals) for a multiple linear regression model using Seaborn's "diamonds' dataset.


2. Write a Python script to calculate and print Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for a linear regression model.





In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Select relevant features and target
X = diamonds[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = diamonds['price']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the test set
y_pred = model.predict(X_test)

# Calculate residuals
residuals = y_test - y_pred

# Plot the distribution of residuals
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Distribution of Residuals')
plt.show()


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
# Calculate RMSE
rmse = np.sqrt(mse)

print(f'MSE: {mse}')
print(f'MAE: {mae}')
print(f'RMSE: {rmse}')


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Scatter plot to check linearity
for feature in X_train.columns:
    plt.scatter(X_train[feature], y_train)
    plt.xlabel(feature)
    plt.ylabel('Price')
    plt.title(f'Linearity check: {feature} vs. Price')
    plt.show()

# Residuals plot for homoscedasticity
sns.residplot(x=y_pred, y=residuals, lowess=True)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted values')
plt.show()

# Correlation matrix for multicollinearity
corr_matrix = X_train.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score

# Create a pipeline with feature scaling and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Evaluate the linear regression model
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
print(f'Linear Regression R^2: {np.mean(scores)}')

# Evaluate the Ridge regression model
pipeline.set_params(regressor=Ridge())
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
print(f'Ridge Regression R^2: {np.mean(scores)}')

# Evaluate the Lasso regression model
pipeline.set_params(regressor=Lasso())
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
print(f'Lasso Regression R^2: {np.mean(scores)}')



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Print coefficients, intercept, and R-squared score
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')
print(f'R-squared: {model.score(X_test, y_test)}')


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Load the tips dataset
tips = sns.load_dataset('tips')

# Define the feature and target
X = tips[['total_bill']]
y = tips['tip']

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Predict the values
y_pred = model.predict(X)

# Plot the results
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Total Bill vs. Tip')
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X + np.random.randn(100, 1) * 2

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Predict new values
y_pred = model.predict(X)

# Plot the data points and regression line
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Synthetic Data: Linear Regression')
plt.show()


In [None]:
import pickle
from sklearn.linear_model import LinearRegression

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Save the model to a file
with open('linear_regression_model.pkl', 'wb') as file:
    pickle.dump(model, file)


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X**2 + np.random.randn(100, 1) * 10

# Transform features to polynomial
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Fit the model
model = LinearRegression()
model.fit(X_poly, y)

# Predict the values
y_pred = model.predict(X_poly)

# Plot the results
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Polynomial Regression (Degree 2)')
plt.show()


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X + np.random.randn(100, 1) * 2

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Print the model's coefficient and intercept
print(f'Coefficient: {model.coef_[0][0]}')
print(f'Intercept: {model.intercept_[0]}')


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X**2 + np.random.randn(100, 1) * 10

# Fit polynomial regression models of different degrees
degrees = [1, 2, 3, 4]
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model = LinearRegression()
    model.fit(X_poly, y)
    y_pred = model.predict(X_poly)
    mse = mean_squared_error(y, y_pred)
    print(f'Degree {degree} - MSE: {mse}')


In [None]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Select relevant features and target
X = diamonds[['carat', 'depth']]
y = diamonds['price']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the linear regression

14. Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.



15. Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.

16. Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.datasets import load_boston

# Load dataset
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Calculate VIF
vif_data = pd.DataFrame()
vif_data['feature'] = df.columns
vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(len(df.columns))]

print(vif_data)


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X**4 - 3 * X**3 + 2 * X**2 + X + np.random.randn(100, 1) * 10

# Transform features to polynomial (degree 4)
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)

# Fit the model
model = LinearRegression()
model.fit(X_poly, y)

# Predict the values
y_pred = model.predict(X_poly)

# Plot the results
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Polynomial Regression (Degree 4)')
plt.show()


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load dataset (using Seaborn's diamonds dataset as an example)
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
X = diamonds[['carat', 'depth', 'table', 'x', 'y', 'z']]
y = diamonds['price']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Print the R-squared score
r_squared = pipeline.score(X_test, y_test)
print(f'R-squared: {r_squared}')


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X**3 - 3 * X**2 + 2 * X + np.random.randn(100, 1) * 10

# Transform features to polynomial (degree 3)
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# Fit the model
model = LinearRegression()
model.fit(X_poly, y)

# Predict the values
y_pred = model.predict(X_poly)

# Plot the results
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Polynomial Regression (Degree 3)')
plt.show()


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 5)
y = 3 + 2*X[:,0] + 4*X[:,1] + 5*X[:,2] + 6*X[:,3] + 7*X[:,4] + np.random.randn(100)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Print R-squared score and coefficients
print(f'R-squared: {model.score(X_test, y_test)}')
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X + np.random.randn(100, 1) * 2

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Predict new values
y_pred = model.predict(X)

# Plot the data points and regression line
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Synthetic Data: Linear Regression')
plt.show()


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 3)
y = 3 + 2*X[:,0] + 4*X[:,1] + 5*X[:,2] + np.random.randn(100)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Print R-squared score and coefficients
print(f'R-squared: {model.score(X_test, y_test)}')
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')


In [None]:
import joblib
from sklearn.linear_model import LinearRegression

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Save the model to a file using joblib
joblib.dump(model, 'linear_regression_model.joblib')

# Load the model from the file
loaded_model = joblib.load('linear_regression_model.joblib')

# Print the loaded model's coefficients and intercept
print(f'Coefficients: {loaded_model.coef_}')
print(f'Intercept: {loaded_model.intercept_}')


In [None]:
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Load the tips dataset
tips = sns.load_dataset('tips')

# One-hot encode categorical features
encoder = OneHotEncoder(drop='first')
encoded_features = encoder.fit_transform(tips[['sex', 'smoker', 'day', 'time']]).toarray()

# Create DataFrame for encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['sex', 'smoker', 'day', 'time']))

# Combine encoded features with numerical features
X = pd.concat([tips[['total_bill', 'size']], encoded_df], axis=1)
y = tips['tip']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Print R-squared score
print(f'R-squared: {model.score(X_test, y_test)}')


In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X + np.random.randn(100, 1) * 2

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Fit Ridge Regression
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)

# Print coefficients and R-squared scores
print(f'Linear Regression - Coeff