1. Problem identification 

2. Data wrangling

3. **Exploratory data analysis**

4. Prep-processing and training data development 

5. Modeling (Machine learning steps)

6. Documentation

# Exploratory data analysis
-  is an approach for summarizing and visualizing the important characteristics and statistical properties of a dataset. Visualizing the data will help you make sense of it to identify emerging themes. Identifying these trends will help you to form hypotheses about the data.

# **Data Validation** 
- It involves ensuring that the data is accurate, clean, and suitable for analysis

**1) update data types that need to be differest**

- `data["column"] = data["column"].astype(int)`

  
Rule of thumb when theres a date or time column always convert it to pd.datetime so its not missrepresented as a string
- `df['time'] = pd.to_datetime(df['time'])`

**2) validating categorical data**
- write a code that goes down the column you need to validate that checks if the variable is or not what you are looking for. Store the series in a new variable. 

- ```python 
    variable = ~data["columntovalidate"].isin(["valuetofind"])
    data[variable]   
    ```                            
- The code will return True or False, in this context if it returns True then the value is not in the series

**3) Validating numerical data**
- Ensures that the numerical values are correct and reliable.
- You can validate numerical data by vizually identifying the range, quartiles, and potential outliers. This is also part of dealing with outliers

- ```python
  
    print(unemployment["2021"].min(), unemployment["2021"].max())
    sns.boxplot(x = "column", y = "column", data = data)
    plt.show()
    ```

**4) Data summarization**
- We can explore the characteristics of subsets of data further with the help of the .groupby() function. 

`data.groupby("column").mean()`

- you can also use .sum(), .max(), .min(). etc

**When you want to do several aggregations on columns use the .agg()

`data.agg(["mean", "std"])`
- this does the aggregation on all the numeric columns 

**If you want to do different aggregations to different columns you can specify**

- ```python 
      
    data.groupby("column").agg(
    mean_rating = ("column", "mean"),
    std_rating = ("column", "std)
    )
    ```

**5) Validating Categorical summaries**
- best thing to do is to create a barplot 

- ```python 
    sns.barplot(data = data, x = "columncategory", y = "columnfrequency")
    plt.show()
    ```

# **Data Cleaning and Imputation**
- Dealing with Missing Values: Validate the effectiveness of data wrangling and handle any remaining issues.

- Visualize missing data patterns using heatmaps or other plots.

- Ensure that the methods used for handling missing values in the wrangling step are appropriate.

- If needed, further handle missing values based on insights gained during EDA.

**1) Adressing missing data** 

- not dealing with the missing values makes the data less representative of the population. 

- Drop missing values if they amount to 5% or less of all values.

- If theres more we can replace them for mean, median or mode (imputation) 

    **Method 1** Find the treshold: a value that indicates a limit. It helps determine if a decision should be done or not. Here the decision is should the missing values be dropped or not? Is treshold 5% or less or more than 5%. If ledd then drop all missing values if not then find a strategy to deal with them. 
    
    
    
-  `data.isna.().sum()`
- `treshold = len(data) * 0.05`
- `cols_to_drop = data.columns[ data.isna().sum()<= threshold]`
  will return a list of columns that have missing values less than or equal to the threshold.
- `data.dropna(subset =cols_to_drop, inplace = True )`


     **Method 2** Impute by subgroups, the idea that different subgroups might have different characteristics, and filling missing values within these subgroups can lead to more accurate and meaningful data. Each subgroup if filled with something.


```
       new_variable = data.groupby("colwithsubgroups")
       ["coltocal"].mean().to_dict()
```
```
        data["coltocal"]data["coltocal"].fillna
        (data["colwithsubgroups"].map(new_variable)
```
 
- In this method its smart to create a boxplot of columns with missing data to identify trents, outliers or just any other trends. 


**2) Converting and analyzing categorical data**

- looking for rows containing 1 phrase

  
  `data["column"].str.contains("stslokkingfor")`

- Looking for rows containing several phrases using |

  
     `data["column"].str.contains("stslokkingfor | another ")`
    
- Lokking for rows that starts with a specific phrase with ^


     `data["column"].str.contains("^stslokkingfor")`
    

**3) Working with numeric data**



        1- Remove comma values in the column        
   `column.str.replace("original", "new")`

        2- Convert the column to float         
` data["column"]= data["column"].astype(float)`

        3- create a new column by converting the currency?        
  `data["new_col"] = data["existing_col"] * 0.012`

        4- Adding summary statistics into a DataFrame 
` data["std"]= data.groupby("col")["col"].transform(lambda x: x.std())`

**5) Dealing with Outliers**

- 75th Percentile (Q3): The value below which 75% of the data falls

`Q3 = data['values'].quantile(0.75)`
- 25th Percentile (Q1): The value below which 25% of the data falls.

`Q1 = data['values'].quantile(0.25)`

- Then calculate IQR 

`IQR = Q3-Q1` 

- Finally calculate the Calculate the Lower Outlier Threshold

`Lower_Outlier = Q1 - (1.5 * IQR)`

- And the Upper Outlier Threshold

`Upper_Outlier = Q3 + (1.5 * IQR)`


**Removing Outliers?**



`no_outliers = data[(data["col"] < Lower_Outlier) | (data["col"] > Upper_Outlier) ] \ [["col", "col", "col"]]`


`data= data.drop(no_outliers.index)`

**6) Creating a PCA - principal component analysis**
- it is a statistical technique used for dimensionality reduction. 
- technique used to simplify complex datasets by transforming the original features into a new set of features called principal components. These new features are uncorrelated with each other and ordered by the amount of variance they explain in the original data. It does focus more on the new features with high variance. 

   1. Make a new dataframe that only has the numeric columns. The index should be a label you want to maintain. 
            - Save the colum names on a new variable 

`columns = data.columns`
        
  2. scale the data with scale(). This will return an array not a datafrme. 
             - Standardizing data involves transforming your features so they have a mean of 0 and a standard deviation of 1. This is also known as Z-score normalization
     
   3. Make a dataframe from the array 
        
 `dataframe_from_array = pd.DataFrame(scaled_array, columns= original_columns)`
 
   4. Check the mean of the new standardized dataframe
  
  `dataframe_from_array.mean()`
  
    -Checking the mean of the scaled features using the mean() DataFrame method is important because it helps you verify that the standardization process was done correctly.Remember the output will be in scientific notation. 

    5. check the mean of the new standardized dataframe. make sure to use ddof=0 parameter 
   
   `dataframe_from_array.std()`
   
   6. Calculate the PCA transformation 
   
   - The PCA model analyzes the scaled data (dataframe_from_array) and learns the principal components—essentially identifying the directions that capture the most variance in your data. This step doesn't change or transform your data; it just learns from it.
   
   `pca = PCA().fit(dataframe_from_array)`


   7. Plot the cumulative variance ratio with number of components.

- The explained_variance_ratio_ is an attribute provided by the PCA (Principal Component Analysis) object in the scikit-learn library. When you fit the PCA model to your data, it calculates how much of the total variance in the data is explained by each new feature created from original data. This information is stored in the explained_variance_ratio_ attribute.
   
   `plt.subplots(figsize=(10, 6))
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('label')
plt.ylabel('label')
plt.title("title");`

   8. Apply the transformation to the data to obtain the derived features.
   
   - The transform() method uses the learned principal components to project your original scaled data into the new principal component space. This creates a new representation of your data in terms of the principal components, effectively transforming your original data.
   
   ` transform_with_pca = pca.transform(dataframe_from_array)`

   