# Cleaning Data

**_Author: Jessica Cervi_**

**Expected time = 1 hrs**

**Total points = 45 points**

    
## Assignment Overview


In this assignment you will work with the pandas concepts you learned to examine and clean Data in a DataFrame.  You will use the concept of Tidy Data and to explore different ways of cleaning Data, such as filling in missing values or getting rid of duplicates. 


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 


### Learning Objectives

- Examine and clean data to prepare data  for analysis.  
- Describe the key components of Tidy Data. 
- Practice cleaning data across data types. 
 

## Index:


#### Cleaning Data

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)
- [Question 7](#Question-7)
- [Question 8](#Question-8)
- [Question 9](#Question-9)

## Cleaning Data

In this assignment you will examine and clean data using pandas. You will also use the Tidy Data framework as you solve new data cleaning problems. In this assignment we will use a dataset that describes sample grading of the students in a course.It includes columns for the enrollment numbers in each course and grades for each assignment. More detailed information about the dataset can be found [here](https://www.kaggle.com/tanmoyie/grading-of-the-students-in-the-exam-or).



### Inspecting your Data

We will begin this assignment by inspecting the data. In this step, we want to visualize the data and look for ways to make the dataset more usable for our research questions. We will rename the Class Roll column to Enrollment, which may help us remember what this represents. We will then convert the entries of the Enrollment column from integers to floats. 

We will review the attributes .shape, .info and .columns


We will begin by importing the necessary libraries for this assignment and by reading the dataset.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("./data/grading.csv")

Next, we will use the command `.head()` to visualize the first 10 rows of our DataFrame

In [None]:
df.head(10)

[Back to top](#Index:) 

### Question 1
*5 points*

Retrieve the number of rows of the DataFrame df and assign the value as an integer to ans1.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans1 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 2
    
*5 points*

What does the attribute `.columns` return? 
- a) A dictionary that has column labels as keys and the mean of each column as values. 
- b) The number of columns in the DataFrame.
- c) The column labels of the DataFrame.
- d) An empty DataFrame with the column labels.

Assign the letter corresponding to your choice as a string to `ans2`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans2 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 3
    
*5 points*

Use the `.rename()` function in `pandas` to rename the column `Class Roll` to `enrollment` and assign this to a new DataFrame called `df3`.

__Note: questions 3, 4, and 5 are related and need to be solved in sequence.__

In [None]:
### GRADED

### YOUR SOLUTION HERE
df3 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 4
    
*5 points*

Convert the entries from `df3` in the column `Observation (Marks: 20)` from integers to floats. Assign the result to `df4`.

*Hint: use the function `.to_numeric()` by setting the argument `downcast = float`*

In [None]:
### GRADED

### YOUR SOLUTION HERE
df4 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 5
    
*5 points*


What is the type of `df4`? 
- a) A dictionary 
- b) A pandas DataFrame
- c) A pandas series
- d) A list

Assign the letter corresponding to your choice as a string to `ans5`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans5 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

## Dealing with Missing or Duplicate Data
You will often have missing or duplicate data in your dataset. It is important that you know how to work through this issue without fundamentally impacting the results of your analysis. The first sentinel value used by Pandas is `None`, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, `None` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type `object`.
The other missing data representation, `NaN` (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation. A complete guide about how to handle missing values values in pandas can be found [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html).


Next, we will fill in missing values with `Nan`. In another column, we will drop all `Nan` values. We will also drop all duplicate values. 

### Question 6
    
*5 points*



Take the column `CT-2 (Marks: 20)` of the DataFrame `df3` and fill the missing values with the mean value of that column. Assign the resulting Series to the variable `s6`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
s6 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 7
    
*5 points*

Drop all of the Nan entries in `df3`. Assign the result to `df7`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
df7 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 8
    
*5 points*

Compute the mean of `enrollment` in the DatFrame `df3`. Assign the result to `ans8`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans8 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 9
    
*5 points*

Drop all of the duplicate values in the DataFrame `df3`. Assign the new result to `df9`.

In [None]:
### GRADED

### YOUR SOLUTION HEREd
df9 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
