# 0. Imports

### 0.1 Import our Automation Code

In [1]:
import automation as auto

### 0.2 Import the Dataset

The Movies Revenue dataset contains 34 features related to movies, including `budget`, `title`, `runtime`, `crew`, `cast`, and more, with the goal of predicting movie `revenue`.  
•	**Dataset dimensions**: (5368, 34)  
•	**Target attribute**: `revenue`

In [27]:
df_train = auto.load_data('./data/dataset_movies_train.csv')
df_test = auto.load_data('./data/dataset_movies_test.csv')
org_train = df_train.copy()
org_test = df_test.copy()
target = 'revenue'

### 0.3 Artificially Introduce Nulls to the Data
Since our dataset doesn't have missing values, for the sake of demonstration, we've artificially removed some of the values from a few attributes.

In [28]:
attributes = ['budget', 'original_language', 'vote_count', 'popularity']
percentages = [27, 32, 27, 25]
df_train, df_test, missing_train, missing_test = auto.create_mcar(df_train, df_test, attributes, percentages)

# 1. Explore the Missing Values
**Method**: `show_nulls`

**Input**:
- **df_train** → The training dataset  
- **df_test** → The test dataset  

**Description**:
The `show_nulls` method provides an overview of missing values in the given datasets. It identifies attributes with missing values and displays:  
- The count of missing values per attribute  
- The percentage of missing values relative to the entire dataset  

This helps in understanding the extent of missing data before applying imputation techniques.

In [29]:
auto.show_nulls(df_train, df_test)

Train data:
                   Null Count Percentage
original_language        1374    31.998%
budget                   1159    26.991%
popularity               1073    24.988%
vote_count               1159    26.991%

Test data:
                   Null Count Percentage
original_language         343    31.937%
budget                    289    26.909%
popularity                268    24.953%
vote_count                289    26.909%


# 2. Imputing Attributes and Evaluating the Scores

### 2.1 Imputing an Attribute

**Method**: `try_all_methods`

**Input**:
- **df_train** → The training dataset  
- **df_test** → The test dataset  
- **attribute** → The attribute to be imputed  
- **target** → The name of the target attribute  
- **is_categorical** → A boolean indicating whether the attribute to be imputed is categorical or numerical  

**Output**:
- **df_array** → A list of datasets, where each pair represents the train and test data created using a different imputation method.  

**Description**:
The `try_all_methods` function applies various imputation techniques to the selected attribute and returns the resulting datasets. This allows for a comprehensive comparison of different imputation strategies.


In [30]:
# Example of imputing the attribute budget
attribute = 'budget'
df_arr = auto.try_all_methods(df_train, df_test, attribute, target, is_categorical=False)

### 2.2 Evaluating the Scores

**Method**: `eval_and_show`

**Input**:
- **df_array** → A list of datasets containing the train and test data for each imputation method.  
- **target** → The target attribute of the dataset.  

**Output**:
- Displays the evaluation metrics of the model after applying the imputations.  

**Description**:
The `eval_and_show` function evaluates the impact of different imputation methods by measuring the model's performance on the imputed datasets. It provides a comparative analysis of how each method affects predictive accuracy.


In [31]:
auto.eval_and_show(df_arr, target)

[1mError Evaluation Metrics for Different Methods:[0m
                   R2 Score     MSE Score    RMSE Score     MAPE Score  \
Random                0.292  2.911342e+16  1.706265e+08  31567153.742%   
Mean                  0.637  1.491611e+16  1.221315e+08   5383747.338%   
Median                0.633  1.506832e+16  1.227531e+08   2523536.329%   
Frequent              0.637  1.493441e+16  1.222064e+08   4048972.291%   
KNN                   0.637  1.491611e+16  1.221315e+08   5383747.338%   
Linear Regression     0.615  1.581651e+16  1.257637e+08   3032091.729%   
Drop                  0.679  1.459934e+16  1.208277e+08    4933155.66%   

                      MAE Score  
Random             1.110190e+08  
Mean               6.518379e+07  
Median             6.491659e+07  
Frequent           6.482770e+07  
KNN                6.518379e+07  
Linear Regression  6.718780e+07  
Drop               5.981908e+07  


# 3. Deciding on an Imputation Method

## 3.1 Asking for General Tips Regarding the Choice

**Method**: `tips`

**Description**:
The `tips` function provides general guidance on selecting an imputation method by listing the advantages and disadvantages of each approach. This helps users make an informed decision based on their dataset and analysis needs.


In [2]:
auto.tips()


    Here are some advantages and disadvantages of the imputation methods:

    1. **Random Imputation**
        - **Advantages:**
            - Simple and fast.
            - Useful for large datasets with small missing data.
            - Serves as a baseline or fallback method when other methods don’t significantly improve results.
        - **Disadvantages:**
            - Does not account for relationships between features.
            - Can introduce noise and distort patterns.
            - No statistical basis, potentially leading to unreliable results.

    2. **Mean Imputation**
        - **Advantages:**
            - Simple and fast.
            - Works well with normally distributed data.
            - No loss of data points.
        - **Disadvantages:**
            - Can distort data distributions, especially in skewed data.
            - Reduces variability, which might affect model performance.
            - Not suitable for skewed or non-normal distributions.

    3. **

### 3.2 Choosing a Method for Imputation

#### 3.2.1 Getting the Data of the Chosen Method

**Method**: `get_imputed_data`

**Input**:
- **df_array** → A list containing train and test datasets after performing each imputation method.  
- **method** → An integer representing the chosen imputation method:  
  - `1` → Random  
  - `2` → Mean  
  - `3` → Median  
  - `4` → Frequent  
  - `5` → KNN  
  - `6` → LR  
  - `7` → Drop  

**Output**:
- Returns the train and test datasets corresponding to the selected imputation method.


In [33]:
# Example of imputing using the mean
mean = 2
imputed_data = auto.get_imputed_data(df_arr, mean)

#### 3.2.2 Apply All Imputations

Create a list of pairs `(attribute, imputed_data)` and call the method `apply_all_imputations`.

**Input**:
- **df_train** → The train dataset.  
- **df_test** → The test dataset.  

**Description**:
This method applies all the chosen imputation methods to the given datasets.

In [34]:
data_per_attr = [(attribute, imputed_data)]
auto.apply_all_imputations(df_train, df_test, data_per_attr)

# 4. Save the Imputed Data

Call the method `save`.

**Input**:
- **name** → The desired name of the file.  
- **path** → The path where the file should be saved.  
- **df** → The dataframe to save.  
- **(index)** → The index column (default is `False`).  

**Description**:
This method saves the dataframe to the specified location and prints a confirmation message upon successful completion.

In [35]:
auto.save('train', './data/imputed/', df_train)
auto.save('test', './data/imputed/', df_test)

Data saved to ./data/imputed/train_imputed.csv
Data saved to ./data/imputed/test_imputed.csv
