**Methodologies of Data Science Projects**
1. CRISP-DM :- Cross industry Standard Process for Data Mining.
  1. Business Understanding:-
      - Who are the stakeholders in this project? Who will be directly affected by the creation of this project?
     - What business problem(s) will this Data Science project solve for the organization?
    - What problems are inside the scope of this project?
    - What problems are outside the scope of this project?
    - What data sources are available to us?
    - What is the expected timeline for this project? Are there hard deadlines (e.g. "must be live before holiday season shopping") or is this an ongoing project?
   - Do stakeholders from different parts of the company or organization all have the exact same understanding about what this project is and isn't? 

 2. Data Understanding :-
   - What data is available to us? Where does it live? Do we have the data, or can we scrape/buy/source the data from somewhere else?
   - Who controls the data sources, and what steps are needed to get access to the data?
   - What is our target?
   - What predictors are available to us?
   - What data types are the predictors we'll be working with?
   - What is the distribution of our data?
   - How many observations does our dataset contain? Do we have a lot of data? Only a little?
   - Do we have enough data to build a model? Will we need to use resampling methods?
   - How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong? 

 3. Data Preparation:
 - Detecting and dealing with missing values
 - Data type conversions (e.g. numeric data mistakenly encoded as strings)
 - Checking for and removing multicollinearity (correlated predictors)
 - Normalizing our numeric data
 - Converting categorical data to numeric format through one-hot encoding

 4. Modeling -:
  - Is this a classification task? A regression task? Something else?
  - What models will we try?
  - How do we deal with overfitting?
  - Do we need to use regularization or not?
  - What sort of validation strategy will we be using to check that our model works well on unseen data?
  - What loss functions will we use?
  - What threshold of performance do we consider as successful?

5. Evaluation.
6. Deployment.

  

2. KDD - Knowledge Discovery in Databases.
   1. Selection - Business Understanding in CRISP.
   - The output of this stage is the dataset you'll be using for the Data Science project.
   2. Preprocessing - The output of this stage is preprocessed data that is more "clean" than it was at the start of this stage -- although the dataset is not quite ready for modeling yet.
   3. Transformation - The output of this stage is a dataset that is now ready for modeling.
   4. Data Mining - refers to using different modeling techniques to try and build a model that solves the problem we're after -- often, this is a classification or regression task. During this stage, you'll also define your parameters for given models, as well as your overall criteria for measuring the performance of a model.
   5. Evaluation

3. OSEMN 
   1. Obtain - data
   2. Scrub - filter the data
   3. Explore - the data/visualizations
   4. Model - modeling. Regression/ Classification
   5. Interpret results

**Modeling Approaches**
1. For Inference-: modeling data to draw conclusions.
- What is the relationship between X and y?
- How does X affect y?
2. For predictions -: modeling data to make/draw predictions from it.
- it is important for the model to generalize to unseen data. 
* Model Generalization - training a model based on some data then feeding the model new data to make predictions.
- The gap between the training error and prediction error for new data (labeled "optimism") is growing as model complexity increases, which means that we are getting worse at generalizing.
* Model Validation -: 
- Model validation is a process of measuring overfitting and indicates the degree of generalizability.

- Here is how we perform validation, in its simplest form:

 1. Split the data into two parts with a 70/30, 80/20, or a similar split
 2. Use the larger part for training so the model learns from it
 3. Use the smaller part for testing the model

 - This is called a train-test split and means that you can compare the model performance on training data vs. testing data using a given metric. The metric can be R-Squared or it can be an error-based metric like RMSE.

 - Or **CROSS VALIDATION**.splitting the data multiple times and training multiple models, to get more of a distribution of possible metrics rather than relying on metrics from a single train-test split.

## Machine Learning Fundamentals
- involves building models that model the relationship between independent and dependent variables emphasizing on prediction.
1. Model Validation-(Assess how well the model will perform to unseen data)
- Involve use of validation techniques.eg train_test_split from sklearn.model_selection and cross_validation from sklearn.preprocessing
and sklearn.model_selection as well.

2. Bias -: in ML this is the amount in which the model's predictions differ from the true value compared to the training data.Mainly caused by resampling/assumptions    in the model that make the target function easier to learn.
- Error due to overly simplistic assumptions in the model (underfitting).The gap between the training error and prediction error for new data (labeled "optimism") is growing as model complexity increases, which means that we are getting worse at generalizing.

* Variance -: variance describes how much a random variable differs from its expected value. Variance is based on a single training set. Variance measures the inconsistency of different predictions using different training sets.
- Error due to the model being too sensitive to small fluctuations in the training data (overfitting).
* **Bias-variance tradeoff** -:
- Involves striking the balance between the bias and variance(balance between underfitting and overfitting)
- Involves getting the lowest possible bias while also ensuring the model's performance is generalizable to unseen data to **REDUCE ERRORS**

- **Underfitting** -: a model showing small variance and high bias will underfit the target.Making the model too simple.Thus, model will fail to campute the underlying patterns in the data.

- eg A linear regression model used to fit non-linear data would have high bias because it assumes a linear relationship between the features and the target variable, which oversimplifies the true relationship.

- **Overfitting** -: a model with high variance and little bias will overfit the target.

3. Regularization -:
- used to help avoid overfitting.(**REDUCE VARIANCE**)
 Ridge and Lasso regression.(Extensions to linear regression with penalty terms to help prevent overfitting). To reduce model complexity or gather more training data.

 - To reduce bias: Use a more complex model that can capture more patterns in the data (e.g., adding polynomial features to a linear regression model)
 

In [None]:
#to reduce bias you make the model more complex
#using Polynomial Regression
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Fitting Polynomial Regression to the dataset
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

# Visualising the Linear Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue') 
plt.title('Truth or Bluff (Linear Regression)') 
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

# Visualising the Polynomial Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg_2.predict(X_poly), color = 'blue')
plt.title('Truth or Bluff (Polynomial Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()


## Using Cross Validation to get the Bias 

In [None]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=0)

# Define polynomial degrees to test
degrees = [1, 2, 3, 4, 5]

# Store cross-validation results
cv_scores = []

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(model, X, y, cv=10, scoring='neg_mean_squared_error')#neg_root_mean_squared_error
    cv_scores.append(-np.mean(scores))  # Convert to positive MSE

# Plot results
plt.plot(degrees, cv_scores, marker='o')
plt.xlabel('Polynomial Degree')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff')
plt.show()


In [None]:
#or 
# Import the relevant function
from sklearn.model_selection import cross_val_score

# Get the cross validated scores for our baseline model
baseline_cv = cross_val_score(baseline_model, X_train, y_train, scoring="neg_root_mean_squared_error")

# Display the average of the cross-validated scores
baseline_cv_rmse = -(baseline_cv.mean())
baseline_cv_rmse

#after performing polynomial regression
# Get the cross validated scores for our new model
from sklearn.preprocessing import PolynomialFeatures

# Instantiate polynomial features transformer
poly = PolynomialFeatures(2)

# Fit transformer on entire X_train
poly.fit(X_train)

# Create transformed data matrix by transforming X_train
X_train_poly = poly.transform(X_train)

# Fit the model on the transformed data
# Get the cross validated scores for our transformed features
poly_cv = cross_val_score(baseline_model,X_train_poly, y_train, scoring="neg_root_mean_squared_error")

# Display the average of the cross-validated scores
poly_cv_rmse = -(poly_cv.mean())
poly_cv_rmse

## Using train_test_split to validate the model- get the bias

In [None]:
#use train_test_split to split the data into training and testing data
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=0)

#train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#build the model
model = LinearRegression()
model.fit(X_train, y_train)

#predict the model
y_pred = model.predict(X_test)

#get the error
rmse = mean_squared_error(y_test, y_pred, squared=False)





In [None]:
#after  performing polynomial regression
# Instantiate polynomial features transformer
poly = PolynomialFeatures(2)

# Fit transformer on entire X_train
poly.fit(X_train)

# Create transformed data matrix by transforming X_train
X_train_poly = poly.transform(X_train)

# Fit the model on the transformed data
model.fit(X_train_poly, y_train)

#get the error
y_pred = model.predict(X_train_poly)
rmse = mean_squared_error(y_train, y_pred, squared=False)



## Bias

In [None]:
#calculating bias
import numpy as np
import matplotlib.pyplot as plt

#write a function that calculates the bias of a model
def bias(y, y_pred):
   # Calculate the mean of the predictions
    mean_y_pred = np.mean(y_pred)
    
    # Calculate the bias as the difference between the mean predictions and the mean actual values
    bias_value = np.mean(mean_y_pred - y)
    
    return bias_value
    

## Variance

In [None]:
#variance
def variance(y_pred):
    mean_y_pred = np.mean(y_pred)
    variance_value = np.mean((y_pred - mean_y_pred) ** 2)
    return variance_value

**Ridge and Lasso**
1. **Ridge** - (L2 Regularization): Prevents overfitting by shrinking the coefficients to zero but doesn't set them exactly to zero.
-  It’s useful when you believe that all the features have some impact on the target variable, but you want to reduce the magnitude of the coefficients to prevent overfitting.
- Bias-Variance Tradeoff: Ridge regression reduces variance at the cost of introducing some bias.It makes the model less complex, thus reducing variance, but it introduces bias by not allowing the coefficients to fit the data as well as they could without.

- Use Case: Ridge regression is often **used when you have many features that are somewhat correlated**, and you want to avoid overfitting while still keeping all features in the model.

2.**Lasso** (L1 regularization):
-  Lasso regression shrinks the coefficients to zero and also set some of them exactly to zero. 
- **used for feature selection**, as it effectively removes features that are not strongly associated with the target variable.
- Bias-Variance Tradeoff: Lasso reduces variance by shrinking the coefficients. However, because Lasso can set some coefficients to zero, it can increase bias by excluding some variables entirely from the model.

- Use Case: Lasso is **useful when you have a large number of features, and you suspect that only a small subset of them are actually important for predicting the target variable.** It helps in feature selection by shrinking insignificant features' coefficients to zero.

Choosing Between Ridge and Lasso:
 * Use Ridge when:
       -  You have many correlated features.
       -  You believe most features should be retained in the model but need to control their influence.
 * Use Lasso when:
       -  You suspect that only a few features are relevant.
       -  You want a sparse model where some coefficients are exactly zero, effectively performing feature selection.

* Combination (Elastic Net):

    Elastic Net combines both Ridge and Lasso penalties. It’s useful when you want to balance the benefits of both methods, especially when you have many correlated features and you still want to perform feature selection.

## Cross Validation

In [None]:
#import cross_validation
#1. cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

cross_val_score(linreg, X, y) #linreg is for linear regression by default it uses R^2 as scoring metric and k=5
#k=5 is default value meaning 5 fold cross validation

cross_val_score(linreg, X, y, cv=10, scoring='neg_mean_squared_error') #k=10 and scoring metric is mean squared error

#2. using custom make_scorer
from sklearn.metrics import make_scorer

scorer = make_scorer(mean_squared_error, greater_is_better=False) #greater_is_better=False means lower the value better the model

cross_val_score(linreg, X, y, cv=10, scoring=scorer)
#OR
cross_val_score(linreg,X,y,scoring=make_scorer(mean_squared_error, greater_is_better=False))


#3. using cross_validate
from sklearn.model_selection import cross_validate

cross_validate(linreg, X, y, cv=10) #but it also returns the timing.

cross_validate(linreg, X, y, scoring=["r2", "neg_mean_squared_error"]) #to get both r2 and mean squared error
 
 #And if you want to compare the train vs. test scores (e.g. to look for overfitting), that would look like this:

cross_validate(linreg, X, y, return_train_score=True)

### displaying the cross validation scores

In [None]:
#displaying the mean of the scores
cross_val_results = cross_validate(linreg, X, y, scoring="neg_mean_squared_error", return_train_score=True)
# Negative signs in front to convert back to MSE from -MSE
train_avg = -cross_val_results["train_score"].mean()
test_avg = -cross_val_results["test_score"].mean()

fig, ax = plt.subplots()
ax.bar(labels, [train_avg, test_avg], color=colors)
ax.set_ylabel("MSE")
fig.suptitle("Average Cross-Validation Scores")

#or to look at the distribution of scores, you could do a histogram

cross_val_results = cross_validate(linreg, X, y, cv=100, scoring="neg_mean_squared_error", return_train_score=True)
train_scores = -cross_val_results["train_score"]
test_scores = -cross_val_results["test_score"]

fig, (left, right) = plt.subplots(ncols=2, figsize=(10,5), sharey=True)
bins=25
left.hist(train_scores, label=labels[0], bins=bins, color=colors[0])
left.set_ylabel("Count")
left.set_xlabel("MSE")
right.hist(test_scores, label=labels[1], bins=bins, color=colors[1])
right.set_xlabel("MSE")
fig.suptitle("Cross-Validation Score Distribution")
fig.legend();

## When to use Cross Validation and when to use Train_test_split for model validation
- Cross validation is better than train_test_split because:
1. it reduces dangers of overfitting that can be caused by train_test_split method.
2. it is better with small dataset as it allows each data point to be used for both training and testing.
- Train_test_split is used with large dataset.


