1. Problem identification 

2. **Data wrangling**

3. Exploratory data analysis

4. Prep-processing and training data development 

5. Modeling (Machine learning steps)

6. Documentation

<div class="span5 alert alert-warning">
<h3>Data Wrangling</h3>
 
**involves taking raw data and preparing it for processing and analysis**


Do you think you may have the data you need to tackle the desired question?

Have you identified the required target value? (what are you predicting)

Do you have potentially useful features? (many columns)

Do you have any fundamental issues with the data? (yes)

<div class="span5 alert alert-warning">
<h3>What Interviewers Are Look For</h3>

- When explaining a project explain complexity obtaining the data. Was it provided by someone, did you have to extract it from somewhere?
  
- Have both business understanding and data understanding of the data and problem. 

- Data cleaning techniques understanding like histograms, data distributions, how skewness and kurtosis is handled rather than saying you dropped the missing values and duplicates. 

**<span style="background-color: peachpuff;">Explore all data and columns that might have what youre looking for. What do you need to clean?</span>**

**All data**
- `data.shape()`
- `data.info()`
- `data.describe()`

**Column Exploration**
- `data[data.columnname == " "].T`
- `data["column"].unique`
- `data["column"].nunique`
- `data["column"].value_counts()`
- `(data["column"] + ', ' + data['column']).value_counts()`


**<span style="background-color: peachpuff;">Deal with missing values (see "Missing Values" file for more in debt explanation)</span>**

- Removing specific rows/columns with excessive missing values.

- Filling missing values with mean, median, mode, or other appropriate values.

        - This depends on the skweness of the data

        - Mean vs. Median Imputation:

            If the distribution is normal, mean might be fine.

            If it's skewed, median is more robust.

        - Mode Imputation for Categoricals:

            You need to know the frequency distribution to choose the most representative value.

- Imputing missing values using more sophisticated methods (e.g., regression, KNN).

**<span style="background-color: peachpuff;">Methods</span>**

- `data.isnull().sum()`
- create a new DF showing the percentage of missing values per column:

    - ``` python
      missing = pd.concat([data.isnull().sum(), 100 * data.isnull().mean()], axis=1)
      missing.columns = ['count', '%']
      missing.sort_values(by='count', ascending=False)
      
      ````


      



 - Are any columns with high percentage of missing values?
 - This could be due to column being categorical and a number representing something. For exaple boolean 0 = no and 1 = yes.


**<span style="background-color: peachpuff;">Investigate Outliers</span>**

- Box plots are useful for visualizing the spread of the data and identifying outliers.
- A histogram can also show any obiouse outliers. Outliers are bars that are isolated from the rest of where the data is clustered. When the distribution is skewed the tail of the distribution can also represent outliers.
- Depending on the business problem you will need to keep or remove them but this step is left for 0.3 Exploratory analysis. Here we only investigate the outliers. 


- **To compare multiple columns with boxplot**

    ```python 
    plt.subplots(figsize=(12, 8))
    sns.boxplot(x="column ", y="column", hue="column", data= data)
    plt.xticks(rotation="vertical")
    plt.ylabel(" something ")
    plt.xlabel("something")
    plt.show()
    ```


**<span style="background-color: peachpuff;">Deal with Duplicates</span>**

- verify if they are actually duplicates if they are drop the extras.

- `duplicates = data.duplicated()`
- `duplicates_sum = duplicates.sum()`

**Verify they are duplicates**

- `data[data['column'] == 'duplicate']`
  
**Completed duplicates vs Incomplete duplicates** 
- Completed duplicates - full row is the same as other rows
- incomplete duplicate - only certain columns of the row are duplicates not all
- If you want to return incomplete run following code 

    - `data.duplicated(subset=['columnname', 'columnname'], keep=False)`

 - If you want to return the completed duplicates run the following
    - `duplicates = data[data.duplicated(keep=False)]`

**Drop them if they are duplicates** 

- `data_cleaned = data.drop_duplicates()` 
  


**<span style="background-color: peachpuff;">Data Types Transformations</span>**

- Making sure all data types are correct in this steps prepares the data for analysis in EDA and Modeling

        - Operations behave correctly (e.g., sorting dates chronologically)
        
        - Statistical functions don’t break (e.g., mean of strings = 💥)
        
        - Machine learning algorithms receive the right input formats
  



```python
    
# Convert string to datetime
df['date'] = pd.to_datetime(df['date_column'])
    
# Convert float to integer
df['score'] = df['score'].astype(int)
    
# Convert object to category
df['category'] = df['category'].astype('category')
    

**<span style="background-color: peachpuff;">Text Cleaning</span>**

- Cleaning up categorical values prepares data for analysis and modeling. 


```python

# Lowercase and remove punctuation
df['clean_text'] = df['text'].str.lower().str.replace(r'[^\w\s]', '', regex=True)

#remove non alphabetical characters 
df['clean_text'] = df['clean_text'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

#split words
df['tokens'] = df['clean_text'].str.split()


