# Data Cleaning: Dropping N/A and Duplicate Values

Data might not always be perfect, and may contain observations that are missing values for variables, or may contain duplicate observations. In some cases, we want to able to clean up datasets by dropping these values, using methods from the `pandas` library.

This is especially helpful for plotting visualizations using the `Python 3` kernel, because you will be visualizing less points and thus making the visualization load quicker. For example, for our affluence bubble map visualization, we cut down our large dataset from over 200 million observations to around 2 million observations by removing any observations with `NaN` values and any duplicate observations. We had to do this because the code we wanted to use to create this visualization needed the `geoviews` library, which the `rapids` kernel does not support.

## Import Libraries

Here we are importing the necessary libraries for the data cleaning code.

In [1]:
import os
import pandas as pd

## Read In Data

Read in the data that you would like to drop certain values for. The `os.getcwd()` method gets the current working directory that you are in, which should be inside the `data_wrangling` folder. However, to access the data file, we need to replace the current working directory with the directory that leads to the file. Once that has been done, we can go head with reading in the data and performing the necessary data manipulations.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('data_wrangling', 'synthetic_data')

In [3]:
tester = pd.read_parquet(DATA_DIR + '/final_combinedsubclus.parquet')

## Drop N/A Values

We can drop `NaN` values using the `.dropna()` function. For more information, you can visit this link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html.

In [5]:
df1 = tester.dropna(how='any', axis=0)

## Drop Duplicate Values

We can drop duplicate values using the `.drop_duplicates()` function. For more information, you can visit this link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html.

In [6]:
df2 = df1.drop_duplicates(keep=False)

## Export as Parquet File

Once you've dropped all of the values and gotten your dataset down to a smaller number of observations, you can save your dataframe as a `parquet` file so that you don't need to run this code each time you want to use this data. 

In [8]:
df2.to_parquet(DATA_DIR + '/cleaned_final_combinedsubclus.parquet')