# **Pandas Basics on CSV reading and Data Manipulation**

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. It is built on top of the NumPy package, which means Numpy is required for operating the Pandas. Pandas is a popular library for working with tabular data, as it provides some sets of powerful tools like DataFrame and Series that mainly used for analyzing the data. Besides, Pandas also has higher performance when dealing with very large dataset. In this exercise, we will see how can we use Pandas to read csv file and do some basic calculation.

In this exercise we have three different csv files, which is data coming from the classification results and validation data in the Landsat scene.

### Data overview:
**1_accuracy_groundtruth.csv** 
Validation data manually collected
* it has one column with three values, 1: snow, 2: snowfree, 12: invalid

**2_accuracy_class1.csv** 
Classification results using classification scheme 1
* it has one column with three values, 11: snow, 10: snowfree, 12: cloud, 15: water, 16: shadow, 0: missing data

**3_accuracy_class2.csv** 
Classification results using classification scheme 2
* it has one column with three values, 11: snow, 10: snowfree, 12: cloud, 15: water, 16: shadow, 0: missing data

### **Note: All cloud, water, shadow, and missing data are considered invalid in the ground truth data.**

## Task 1: Import required libraries

## Task 2: Data read
Read all 3 csv files into Pandas dataframe

* Ground Truth as **groundtruth_df**
* Classification 1 as **class1_df**
* Classification 2 as **class2_df**

HINT: use read_csv()

## Task 3: Data overview
Print the first 10 lines of **groundtruth_df** and the last line of **groundtruth_df**

HINT: use head() and tail()

## Task 4: Funtion for dataframe length
Write a function **check_len()** to test if the number of rows in the three Dataframes are the same. It takes input of **df1**, **df2**, **df3**. 
<br>If they are the same, return **True**. If not, return **False**. <br>Apply the three dataframes into the function.

## Task 5: New pandas dataframe
Create a Pandas Dataframe which includes three columns: 
* **truth**: from groundtruth_df
* **class1**: from class1_df
* **class2**: from class2_df

## Task 6: Data Post-processing
Now we shall have a Pandas dataframe with 3 columns of different values representing different classes. But right now invalid class have different values in class1 and class2 compared to the truth column. And the data type of three columns are not identical. So we need to:

1) Make sure all columns have the same data type (integer)
2) Change **cloud, water, shadow, and missing data** in class1 and class2 to **invalid class (12)**
3) In the column **truth** change values 1 into 11, 0 into 10
<br>

So at the end, we should have:
* **10: Snowfree**
* **11: Snow**
* **12: Invalid (Snow, Cloud, Missing data, Shadow)**

HINT: use replace() and don't forget to assign it back to the variable, or use inplace = True

## Task 7: Handling Missing Values
We have to pay attention that for some rows values might not be available in class1 or class2. To deal with those missing data, it is the best to remove the whole row, even through only values in one single column is missing.

HINT: use dropna()

And we need to look at the first five lines of df again.

## Task 8: Data Exploration

Find out how many observations remain for each class in the truth column (three classes in total).

HINT: use groupby()

## Task 9: Data Export

Export df as a csv file.

HINT: use to_csv()