<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Basic_Data_Cleaning_(answer_key).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Basic Data Cleaning**

Data cleaning is basically the “process of fixing or removing data that is inaccurate, duplicated, or outside the scope of your research question” (https://datascience.cancer.gov/training/learn-data-science/clean-data-basics).
It is a very time consuming task. Thus, data scientists usually spend significant time getting the datasets in a final form that they can work in subsequent steps.

It is important to be able to deal with messy data, whic includes missing values, inconsistent formatting, malformed records, or nonsensical outliers.
We will leverage pandas and NumPy libraries to perform some data cleaning steps.


In this tutorial, you will learn:

* How to identify and remove column variables that only have a single value.
* How to remove columns with duplicated information.
* How to identify and remove rows that contain duplicate observations.

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

#Messy Dataset


Breast cancer dataset classifies breast cancer
patient as either a recurrence or no recurrence of cancer. There are 286 examples and nine
input variables.

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))





In [None]:
#Exercise: Review the content and description of the Breast cancer dataset by cliking on the links
#The !wget command is used to download files from the internet. Format: !wget "URL" -O filename.csv. The -O option in the wget command is used to specify the name of the file that you want to save the downloaded content as. In this case, filename.csv is the name of the file where the content from the URL will be saved. Please note that it's not -0 (zero), it's -O (capital o).
#download the breast-cancer.csv file and save it as bcancer_data.csv. Then print the first rows of the file.

#Your code here
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" -O bcancer_data.csv
!head bcancer_data.csv


In [None]:
#Now get read the csv using pandas (remember last module). Note that the file does not contain headers, so when you read the file use header=None

#Your code here
#from pandas import read_csv
import pandas as pd
bcancer = pd.read_csv("bcancer_data.csv", header=None)
bcancer.head()


In [None]:
#let's add the column labels (check on the breast-cancer.names link). Hint: .columns function

#your code here
bcancer.columns = ['age', 'menopause', 'tumor-size', 'inv-nodes', 'node-cap', 'deg-malig', 'breast', 'breast-quad', 'irradiat', 'class']
bcancer.head()


In [None]:
#check the stats for imported data. Hint: .describe function

#your code here

bcancer.describe()



Answer the following questions:    

1.   How many total observations are there in the Brease-cancer.csv database? 286
2.   What does the unique number represent in each column? E.g., for age column unique is 6-> represents the number of unique data values/labels in the given column, in the case of age, they are 6 because they represent the 6 age ranges

1.   How many observations are there for no-recurrence-events? 201






###Download messy data file

Now we are going to work with a modified version of the Breast Cancer Dataset. It contains a couple of additional rows and columns that help to illustrate the cases below.

In [None]:
!wget "https://raw.githubusercontent.com/carighi/al_ml_workshop/main/data/messy2.csv" -O messy2.csv
!head messy2.csv

#Identify Columns That Contain a Single Value


In [None]:
# summarize the number of unique values for each column using pandas, using .nunique
# load the dataset
import pandas as pd
df = pd.read_csv('messy2.csv', header=None)
# summarize the number of unique values in each column using nunique()
print("Shape of messy data: ", df.shape)
print("Column\t#Unique values ")
print(df.nunique())

We can see that column index 5 only has a single value and should be removed as it won't influence the prediction.

In [None]:
#Alternatively
for i in df.columns:
  unique_val= df[i].nunique()
  print(f'Count of column in {i}:', unique_val)

#Delete columns that contain a single value

In [None]:
# delete column [5] that contains a single unique value. Hint: .drop function
# your code here
import pandas as pd
df2= df.drop(columns=[5])
df2.head()


In [None]:
# Alternatively, when you don't know which ones should be dropped
# load the dataset
#df = pd.read_csv('messy2.csv', header=None)
#print(df.shape)
# get number of unique values for each column using nunique
#counts = df.nunique()
# record columns to delete: This is a list comprehension in Python. It creates a new list, to_del, from the indices i of the counts list where the value v is equal to 1. In other words, it's finding the positions of all elements in counts that are equal to 1.
#to_del = [i for i,v in enumerate(counts) if v == 1]
#print(to_del)
# drop useless columns
#df.drop(to_del, axis=1, inplace=True)
#print(df.shape)
#df.head()



##Identify columns that contain duplicated information.

If same information appears in more than one column (note that the label of the column could be different), then we can remove one of them.

Let's add the labels to the columns

In [None]:
# Add column labels to df2 and view first rows . Hint: use .columns. Here are the labels:'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-cap', 'deg-malig', 'breast', 'breast-quad', 'irradiat', 'class', 'breast_and_quadrant'
df2.columns = ['age', 'menopause', 'tumor-size', 'inv-nodes', 'node-cap', 'deg-malig', 'breast', 'breast-quad', 'irradiat', 'class', 'breast_and_quadrant']
df2.head()

In the messy2 data, the last column is a combination of the breast and breast quadrant. This information is in other columns, so we can remove it.

In [None]:
#remove the breast_and_quadrant column in df2 and display few first rows
df2.drop(columns='breast_and_quadrant', inplace=True)
df2.head()

#Identify rows that contain duplicate data

In [None]:
# calculate duplicates
# Identify and display all duplicate.
#The duplicated() function returns a Boolean Series denoting duplicate rows.
#The keep=False argument marks all duplicates as True. So, df[df.duplicated(keep=False)] returns all the duplicate rows in the DataFrame.
#To print duplicates related to particular column information, then use df.duplicated(subset['col1', 'col2'], keep=False)
duplicate_rows = df2[df2.duplicated(keep=False)]
print(duplicate_rows)




#Delete rows that contain duplicate data

In [None]:
# delete rows of duplicate data from the dataset
# delete duplicate rows
df2.drop_duplicates(inplace=True)
print(df2.shape)