# Loading data from CSV files using `pandas`

`pandas` is a popular data science package for `Python`. 
It includes different data structures and tools that provide for flexible **data manipulation** and **analysis**. 
Throughout this course, we will be using what are known as `data frames`, a 2-dimensional table that **supports different data types**. 
This is one of the key structures that `pandas` provides. Data frames also exist in the `R` Statistical programming language.

Here is how it works:


In [None]:
import pandas as pd

with open('NationalNames1.csv', 'r') as file:
    df_names = pd.read_csv(file)
    del df_names['Id']
    

It's that simple!

Actually, **it can be even simpler**. Running the one-line piece of code below would actually produce (almost) the same thing.
However, in the example above we wanted to delete the old `Id` column from the source file before outputting into the data frame, so we wrapped that in the `with open()` statement above and ran `del df['Id']`. 

```python
df_names = pd.read_csv('NationalNames1.csv')
del df_names['Id']
```

Let's see what the data frame we created looks like ... 

In [None]:
df_names.shape # this gives dim (row, col)

In [None]:
df_names.head()  # this gives the first 5 rows of the data frame

As you can see, the top row is a header row. 
Whereas we had to remove the header row in the list of lists in order to manipulate the data, this row serves as a reference to the columns in `pandas`. 
The unnamed column on the left are the row indexes. 

### Data Filtering

`pandas` allows for easy filtering of the data. 
Imagine that you are only interested in tracking how many people each year named their child "Grant". 
All we have to do is filter the `Name` column by the string "Grant"

In [None]:
df_names[df_names['Name']=='Grant'].head()

That was a very simple line of code, which is one of the benefits of using `pandas`. 

We can break it down step by step. 

1. Let's start with the line `df_names['Name']=='Grant'`. 
This is going to return `True` or `False` given whether or not the condition is met. 
Running this line by itself would just return a boolean for each index.  
1. For this reason, we wrap that in `df_names[...]`, which returns a subset of the dataframe, `df_names` where the index returns `True`, thus returning all rows that match the string, "Grant". 

1. The final `.head()` piece is a `pandas` method that you can call on dataframes to return the first rows. 
You could pass a numerical parameter to return the first X-number of rows. 
For example, if we put `df[df['Name']=='Grant'].head(10)`, the first 10 rows of this subset would be returned.

**On your own:** *What do you think `.tail()` in place of `.head()` returns?*

`pandas` supports more advanced filtering as well.

For example, if we want the rows that have `Name`="Grant" **and** `Count` between 50 and 80 ... 
We can further constrain the "Grant" subset with `Count` constraints.


`pandas` allows us to match on many conditions very easily.

In [None]:
df_names[(df_names['Name'] == 'Grant') & (df_names['Count'] > 50) & (df_names['Count'] < 80)] 

One last thing to note:  
You can easily **transpose** a dataframe by accessing its 'T' member. 
See the below example:

In [None]:
df_names.T.head()

For more info on using a `pandas` data frame, we'll refer you to the appropriate documentation here:  
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html  

----

## <span style="background:yellow">Your Turn</span>

* Write the necessary Python code to load the dataset "sample-salesv2.csv" into a pandas dataframe and explore the dataset. Refer to the provided examples above.  

* Once you have the data frame loaded, use `head()` and the condition matching based on the earlier examples to get familiar with this data set.  

* For an extra challenge, look at the `mean` function on the pandas dataframe documentation and use it to find the mean unit price for items that cost less than 50.00. 

The expected mean is 30.341589.

----
`Pandas` data frame documentation:  
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html  


In [None]:
# Add your code below this line 
# Note, Normally you would need to import pandas, but if you ran the above code, 
# Pandas will be imported already.
# -----------------------------

df = pd.read_csv("sample-salesv2.csv")


# SAVE YOUR NOTEBOOK, then `File > Close and Halt`