------------
    CRISP DM Framework
-----------------

CRISP-DM, which stands for **Cross-Industry Standard Process** for Data Mining, is an industry-proven way to guide your data mining efforts.

<img src="CRISP-DM.png" width = "450">

The CRISP-DM (Cross-Industry Standard Process for Data Mining) framework is a widely used methodology for guiding data mining and analytics projects. It consists of six main phases:

1. **Business Understanding**: In this phase, the project objectives and requirements are understood from a business perspective. This involves understanding the problem, defining goals, and determining what success looks like.

2. **Data Understanding**: Here, data collection and exploration take place. This involves understanding the available data, its quality, and its relevance to the problem at hand. It may also involve data preprocessing steps such as cleaning and transformation.

3. **Data Preparation**: Once the data is understood, it needs to be prepared for modeling. This involves selecting appropriate datasets, cleaning the data, handling missing values, transforming variables, and other data preprocessing tasks.

4. **Modeling**: In this phase, various modeling techniques are applied to the prepared data. This could involve techniques such as machine learning algorithms, statistical models, or other analytic methods depending on the nature of the problem.

5. **Evaluation**: Once models are built, they need to be evaluated to determine their effectiveness in solving the problem. This involves assessing model performance, comparing different models, and selecting the best-performing one(s) for deployment.

6. **Deployment**: In the final phase, the insights gained from the data are deployed into the business operations. This could involve integrating the model into existing systems, creating reports or visualizations for stakeholders, and monitoring the model's performance over time.

CRISP-DM is iterative, meaning that it's common for steps to be revisited or repeated as the project progresses and new insights are gained. It provides a structured approach to data mining projects, helping to ensure that all relevant aspects are considered and that the project stays focused on delivering value to the business.

In sort

CRISP-DM is a structured methodology for data mining projects, encompassing six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. It iteratively guides through problem understanding, data exploration, model building, and deployment, ensuring alignment with business objectives and maximizing value from data.

-------------------------------
    Data Preparation and Feature Engineering
---------------------------------

**Null/Missing value treatment**

Null or missing value treatment involves handling instances where data points are absent in a dataset. Common techniques include:

1. **Deletion**: Removing rows or columns with missing values. This is simple but can lead to loss of valuable information.

2. **Imputation**: Filling missing values with substitutes. This could involve using statistical measures like mean, median, or mode, or more complex methods such as K-nearest neighbors (KNN) imputation or predictive modeling.

3. **Forward Fill/Backward Fill**: Propagating non-null values forward or backward in time series data to fill null values.

4. **Interpolation**: Estimating missing values based on existing data points using interpolation techniques like linear or polynomial interpolation.

5. **Predictive Modeling**: Building a model to predict missing values based on other variables in the dataset.

6. **Multiple Imputation**: Generating multiple imputations for missing values, incorporating variability caused by uncertainty in the imputation process.

The choice of method depends on factors such as the nature of the data, the extent of missingness, and the analysis objectives. Each technique has its advantages and limitations, and it's crucial to assess which approach best suits the specific dataset and analytical goals.

<img src ="Missingdata.jpg" width="450">

There are three main types of missing values:

**Missing Completely at Random (MCAR)**: MCAR is a specific type of missing data in which the probability of a data point being missing is entirely random and independent of any other variable in the dataset. In simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the characteristics of the data point itself.

**Missing at Random (MAR)**: MAR is a type of missing data where the probability of a data point missing depends on the values of other variables in the dataset, but not on the missing variable itself. This means that the missingness mechanism is not entirely random, but it can be predicted based on the available information.

**Missing Not at Random (MNAR)**: MNAR is the most challenging type of missing data to deal with. It occurs when the probability of a data point being missing is related to the missing value itself. This means that the reason for the missing data is informative and directly associated with the variable that is missing.

here are examples of each type of missing data mechanism:

**MCAR (Missing Completely At Random):**
Example: A survey where some respondents accidentally skip questions due to a technical glitch in the survey software. The likelihood of missing responses is unrelated to any characteristic of the respondents or their answers.
In this case, missingness is independent of any observed or unobserved variables, making it MCAR.

**MAR (Missing At Random):**
Example: In a medical study, smokers are less likely to report their daily cigarette consumption accurately compared to non-smokers. The likelihood of underreporting cigarette consumption is related to the smoking status (an observed variable), but not directly to the actual consumption itself (the missing data).
Here, the missingness is related to observed variables (smoking status) but not directly to the missing data (cigarette consumption), making it MAR.

**MNAR (Missing Not At Random):**
Example: In a study on income, individuals with higher incomes are less likely to disclose their exact earnings. The likelihood of withholding income information is directly related to the income level itself.
In this scenario, the missingness depends on the value of the missing variable (income), making it MNAR.

    df.isnull().sum() # code snippet

    How to deal with missing values:
    1.never do missing value treatment without consulting with business
    if no business team:
    2.if missing value is less than 1% and data is large you can drop the missing value
    3. if data set is small and you do not want to drop the missing values(be it large or small) you will do imputation.
        imputation:
           if column is numerical and outlier treatment is done replace the missing value with mean
           if column is numerical and outlier treatment is not done replace the missing value with median
           if column is categorical replace the missing value with mode
    4. if missing value in column is greater than 30% drop the column in consultation with business data