# Smartphone Data - Code Review

* Interested party: Procurement Team
* Question: What is the best new mobile phone to offer to the university's employees?
* Data flow: Smartphone data from a CSV file for visualization
* Code to be reviwed: function to plot a variable passed to the function, versus `"price"`
* Problem 1: within this function, there is code that does not adhere to DRY principles and is copied and pasted (modularity issue)
    * Solution: refactor the code using the `column_to_label()` function defined below
* Problem 2: unit test (`test_nan_values`) to ensure `NaN` values were removed from the cleaned DataFrame with smartphone data that was ingested from a CSV. However, the code is not passing the test.
    * Solution: 
        * Re-work this unit test to ensure that it matches the transformation logic in the `prepare_smartphone_data()` function.
        * Ensure that these unit tests execute with `ExitCode.OK`. This means that the `pytest` defined above has passed testing

* General procedures:
    1. Review code for documentation and readability

    2. Review code for PEP-8 compliance

    3. Review code for DRY principles

    4. Reviewing unit tests

## Ingesting from CSV and Preparing Data

In [1]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

file_path = "workspace/sources/datacamp/general_datasets/smartphones.csv"

In [2]:
# Function to prepare the data for visualization

def prepare_smartphone_data(file_path):
    """
    To prepare the smartphone data for visualization, a number of transformations 
    will be applied after reading in the raw DataFrame to memory, including:
        - reducing the number of columns to only those needed for later analysis
        - removing records without a battery_capacity value
        - divide the price column by 100 to find the dollar amount
    
    :param file_path: the file path where the raw smartphone data is stored
    :return: a cleaned dataset having had the operations above applied to it
    """
    
    if os.path.exists(file_path):
        raw_data = pd.read_csv(file_path)
        print(raw_data.head())  # TODO: Use this for checking out the dataset, remove before submission
    else:
        raise Exception(f"File containing smartphone data not found at path {file_path}")

    reduced_columns = [
        "brand_name",
        "os",
        "price",
        "avg_rating",
        "processor_speed",
        "battery_capacity",
        "screen_size"
    ]
    trimmed_data=raw_data.loc[:, reduced_columns]
    
    # Remove records without a battery_capacity value
    reduced_data=trimmed_data.dropna(subset=["battery_capacity", "os"])
    
    # Divide the price column by 100 to find the dollar amount
    reduced_data["price"]=reduced_data["price"]/ 100
    
    return reduced_data


# Call the function
cleaned_data = prepare_smartphone_data(file_path)

  brand_name                    model   price  avg_rating  5G_or_not  \
0      apple          Apple iPhone 11   38999         7.3          0   
1      apple  Apple iPhone 11 (128GB)   46999         7.5          0   
2      apple  Apple iPhone 11 Pro Max  109900         7.7          0   
3      apple          Apple iPhone 12   51999         7.4          1   
4      apple  Apple iPhone 12 (128GB)   55999         7.5          1   

  processor_brand  num_cores  processor_speed  battery_capacity  \
0          bionic        6.0             2.65            3110.0   
1          bionic        6.0             2.65            3110.0   
2          bionic        6.0             2.65            3500.0   
3          bionic        6.0             3.10               NaN   
4          bionic        6.0             3.10               NaN   

   fast_charging_available  ...  internal_memory  screen_size  refresh_rate  \
0                        0  ...               64          6.1            60   
1     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_data["price"]=reduced_data["price"]/ 100


## Creating Tests - NaN/Null Check

In [3]:
# Import required packages
import pytest
import ipytest

ipytest.config.rewrite_asserts = True
__file__ = "notebook.ipynb"


# Create a clean DataFrame fixture
@pytest.fixture()
def clean_smartphone_data():
    return prepare_smartphone_data(file_path)
    
def test_nan_values(clean_smartphone_data):
    """
    Test for no NaN value for "battery_capacity" or "os"
    """
    
    # Assert there are no NaN value in "battery_capacity" or "os"
    assert clean_smartphone_data["battery_capacity"].isnull().sum() == 0
    assert clean_smartphone_data["os"].isnull().sum() == 0

    
ipytest.run("-qq")


[32m.[0m[32m                                                                                            [100%][0m
t_60b29897fd5d459dab839ab1a44a8692.py::test_nan_values
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
    reduced_data["price"]=reduced_data["price"]/ 100



<ExitCode.OK: 0>