# Data Transformation I

Next, we will begin transforming our dataset by dropping values. Our primary goal of this process is to:

* drop rows with missing data
* drop select columns with overwhelmingly missing data

Utilize the documentation provided in each code-block. When you are done with this section of the project, validate that your output matches the screenshot provided in the `docs/part2.md` file.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# TODO: load `data/raw/shopping.csv` as a pandas dataframe
df=pd.read_csv('C:/Users/deema/Desktop/Lab4/shopping-behavior/data/raw/shopping.csv')  


#Make a back up copy of the Original Data Frame
df_copy = df.copy()

In [4]:
# TODO: print out the shape of this dataframe for better clarity
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html

print(df.shape)

(3900, 15)


In [5]:
# TODO: display how many null values are in each column of this dataframe
# Documentation: https://datatofish.com/count-nan-pandas-dataframe/


df.isna().sum()



Customer ID                  0
Age                        390
Gender                       0
Item Purchased               0
Purchase Amount (USD)        0
Location                   390
Size                         0
Color                        0
Season                       0
Review Rating             2469
Shipping Type                0
Promo Code Used              0
Previous Purchases           0
Payment Method               0
Frequency of Purchases    2340
dtype: int64

In [7]:
# TODO: it looks like there is roughly 65% of data missing "Frequency of Purchases". Drop this column, as it is mostly empty and unneeded for our analysis.
# In addition, also drop "Customer ID" as this column is also unnecessary
# Reassign this dropped dataframe as a new variable
# Documentation: drive.google.com/drive/folders/1pAWY1JqIQw26uhtT272AoDDeq7jtbkm2

#Drop columns "Frequency of Purchases" and "Customer ID" from the existing data frame


#new_9df_SampleSz = df.drop(['Frequency of Purchases', 'Customer ID'], axis= 1)

dropped_df = df.drop(['Frequency of Purchases', 'Customer ID'], axis= 1)


In [8]:
# TODO: print out the shape of this dataframe and verify that the shape is "(3900, 13)"

print(dropped_df.shape)

#print(df)

#DF SIZE is Verified at (3900, 13)

(3900, 13)


In [9]:
# TODO: while "Review Rating" is also mostly empty, we are interested in figuring out why some users
# leave reviews and others don't. 

# Therefore we will NOT drop this column. Instead, let's reassign new_8df_SampleSz
# all missing values in "Review Rating" with "Missing", and all non-na values as "Present"
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

#Reassign all Missing Values to "Missing" and 
#all NAN values as "Present"

#dropped_df['Review Rating'] = dropped_df['Review Rating'] .\
#where(dropped_df['Review Rating']).isna(), ("Present") .\
#where(dropped_df['Review Rating']).isna(), ("Missing") .\
    
dropped_df['Review Rating'] = dropped_df['Review Rating'].where(dropped_df['Review Rating'].notna(), "Present")
dropped_df['Review Rating'] = dropped_df['Review Rating'].where(dropped_df['Review Rating'] != "Present", "Missing")


In [10]:
# TODO: Now that we've dropped and transformed our columns, drop the remaining rows that contain missing values
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
#Find null values. Found that "Location is the only other column

dropped_df.dropna()
print(dropped_df)


       Age  Gender Item Purchased  Purchase Amount (USD)      Location Size  \
0      NaN    Male         Jacket              30.904467         Maine    M   
1     21.0  Female       Backpack              31.588259           NaN    L   
2     31.0    Male       Leggings              24.231704        Nevada    M   
3      NaN    Male        Pajamas              33.918834      Nebraska    M   
4     38.0    Male     Sunglasses              36.545487        Oregon    S   
...    ...     ...            ...                    ...           ...  ...   
3895  43.0  Female     Sunglasses              61.610602      Colorado    S   
3896  37.0    Male        Pajamas              44.600556        Alaska    S   
3897   NaN  Female        Handbag              41.781965       Wyoming    M   
3898  39.0    Male         Hoodie              45.343778      Illinois    S   
3899  21.0  Female         Gloves              49.439181  North Dakota    M   

             Color  Season Review Rating   Shipping

In [11]:
# TODO: display how many null values are in each column of this dataframe
# validate that each column has no missing values

print(dropped_df.isna().sum())


Age                      390
Gender                     0
Item Purchased             0
Purchase Amount (USD)      0
Location                 390
Size                       0
Color                      0
Season                     0
Review Rating              0
Shipping Type              0
Promo Code Used            0
Previous Purchases         0
Payment Method             0
dtype: int64


In [12]:
# TODO: print out the shape of this dataframe and verify that the shape is "(3158, 13)"

print(dropped_df)

       Age  Gender Item Purchased  Purchase Amount (USD)      Location Size  \
0      NaN    Male         Jacket              30.904467         Maine    M   
1     21.0  Female       Backpack              31.588259           NaN    L   
2     31.0    Male       Leggings              24.231704        Nevada    M   
3      NaN    Male        Pajamas              33.918834      Nebraska    M   
4     38.0    Male     Sunglasses              36.545487        Oregon    S   
...    ...     ...            ...                    ...           ...  ...   
3895  43.0  Female     Sunglasses              61.610602      Colorado    S   
3896  37.0    Male        Pajamas              44.600556        Alaska    S   
3897   NaN  Female        Handbag              41.781965       Wyoming    M   
3898  39.0    Male         Hoodie              45.343778      Illinois    S   
3899  21.0  Female         Gloves              49.439181  North Dakota    M   

             Color  Season Review Rating   Shipping

In [13]:
# TODO: print out the first 5 rows of this dataframe for validation

print(dropped_df.describe)


<bound method NDFrame.describe of        Age  Gender Item Purchased  Purchase Amount (USD)      Location Size  \
0      NaN    Male         Jacket              30.904467         Maine    M   
1     21.0  Female       Backpack              31.588259           NaN    L   
2     31.0    Male       Leggings              24.231704        Nevada    M   
3      NaN    Male        Pajamas              33.918834      Nebraska    M   
4     38.0    Male     Sunglasses              36.545487        Oregon    S   
...    ...     ...            ...                    ...           ...  ...   
3895  43.0  Female     Sunglasses              61.610602      Colorado    S   
3896  37.0    Male        Pajamas              44.600556        Alaska    S   
3897   NaN  Female        Handbag              41.781965       Wyoming    M   
3898  39.0    Male         Hoodie              45.343778      Illinois    S   
3899  21.0  Female         Gloves              49.439181  North Dakota    M   

             Colo

In [14]:
# TODO: write this newly transformed dataset to the `data/processed` folder. Name it "shopping_cleaned.csv" 
# Be sure to not include an additional index when writing this csv file
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

#Need to check the Newly Created Data set  of <new_df_SampleSz> NEW NAME" "Shopping_Cleaned.CSV"

dropped_df.to_csv('shopping_cleaned.csv', index=False)