## Pandas!
Whilst they're cute, we're not talking about *those* Pandas. We're talking about [Pydata's Pandas](https://pandas.pydata.org/)

I'm assuming you're familiar with the basics of Python or programming syntax as a general.
#### Pandas is:
- A fantastic framework built specifically for data analysis, number crunching and manipulation!
- An essential tool in Data Science
- Easy to pickup, hard to master. Your results will ultimately depend on how useful your dataset is. Especially considering there's many ways of going about it.

I mean, if the [cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) doesn't scare you...Fret not though!

#### The two main components of Pandas
- Series
- Dataframes

Series can be thought of a singular colum in a spreadsheet. It is one-dimension, so can still hold anything like integers, strings, etc.

Dataframes are more complex, and a lot more common. These can be considered as a fully-capable spreadsheet, with multiple columns and rows.



#### First, we need a dataset!
There's **loads** and I mean [loads](https://github.com/awesomedata/awesome-public-datasets) out there for you to experiment with, but for the moment I've made some simple, lightweight and fairly generic datasets that we can use.

You can import data through various means, for example: 
- Database
- CSV Files
- API's

In [1]:
# Lets import pandas! Normally this is not a standard Python library, so you'd have to install via pip (or Anaconda), but I have already installed it for you.
import pandas as pd


# Store the integers into "series"
series = pd.Series([10, 15, 30, 35, 70, 75])

print(series)

0    10
1    15
2    30
3    35
4    70
5    75
dtype: int64


### We're already using Pandas!
We can see Pandas has outputted our integers into a single column, displaying their index numbers (remember that Python starts at 0)


#### But that's very basic and not very helpful, so let's start getting to the nitty-gritty

Dataframes 

In [2]:
# Lets visualise the data of a few purchases from an e-commerce site

orders = {
    'T-Shirts': [5],
    'Mugs': [2],
    'Keyrings': [10]
}

# Insert the values from "orders" into Pandas "Dataframe" 

orders = pd.DataFrame(orders)

# And display!
orders

Unnamed: 0,T-Shirts,Mugs,Keyrings
0,5,2,10


Okay, so it's printed out a table! Great! We have the three columns - the categories, but there's still the index number (0) Let's add more data and see what happens

Pandas requires the array values to be the same.

In [3]:
orders = {
    'T-Shirts': [5, 12, 8, 1],
    'Mugs': [2, 3, 22, 7],
    'Keyrings': [10, 20, 55, 30]
}

# Insert the values from "orders" into Pandas "DataFrame" 
# Each Category in the array will correspond to a column in Pandas.

orders = pd.DataFrame(orders)

# And display!
orders

Unnamed: 0,T-Shirts,Mugs,Keyrings
0,5,2,10
1,12,3,20
2,8,22,55
3,1,7,30


Sweet! We got more data, it's a bit jumbled up though, and we still see the index numbers which tell us nothing!

In [4]:
# We can use Pandas to add an index - replacing the numerical "0 - 3" for each row, we can replace it with something a bit more logical
# For example, Months!

orders = {
    'T-Shirts': [5, 12, 8, 1],
    'Mugs': [2, 3, 22, 7],
    'Keyrings': [10, 20, 55, 30]
}


orders = pd.DataFrame(orders, index=['January', 'February', 'March', 'April'])

orders

Unnamed: 0,T-Shirts,Mugs,Keyrings
January,5,2,10
February,12,3,20
March,8,22,55
April,1,7,30


Ah! Now that makes more sense, we can see that March was their busiest Month for purchases!

But we are only showing 4 Months here, what if we had a dataset that had all 12 Months? We can use Pandas to single out specific indexes.

In other cases, we could use this to locate orders by a specific customer, for one of many examples.

In [5]:
orders.loc['April']

T-Shirts     1
Mugs         7
Keyrings    30
Name: April, dtype: int64

## Lets start reading from CSV Files
I've made a few little spreadsheets with random values that we can have a play with. Again, because these are multi-dimensional, we are still using dataframes

CSV & misc files contains delimiters, these are identifiers that are used to seperate text values. Without these, all the text will be merged and read into one row for example. 

My dataset uses the ',' delimiter, so lets specify that when loading the CSV

In [6]:
# load the csv file into the dataframe
# Pandas has a literal function for reading CSV files! How about that
# Then we specify our datasets delimiter

csvfile = pd.read_csv('simple_orders.csv', delimiter = ',')

csvfile


Unnamed: 0,Month,T-Shirts,Mugs,Keyrings
0,January,22,43,22
1,Feburary,10,246,231
2,March,14,14,34
3,April,2,10,12
4,May,4,2,4
5,June,87,63,7
6,July,43,432,2
7,August,14,123,4
8,September,16,67,8
9,October,29,100,3


We have literally just outputted an entire spreadsheet in essentially two lines of code (ignoring the imports at the start of the notebook)

Still notice the index numbers though? I certainly didn't put them in the spreadsheet, so that's Pandas doing that - as we've seen before.

In [7]:
csvfile = pd.read_csv('simple_orders.csv', delimiter = ',', index_col=0) 

csvfile

Unnamed: 0_level_0,T-Shirts,Mugs,Keyrings
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
January,22,43,22
Feburary,10,246,231
March,14,14,34
April,2,10,12
May,4,2,4
June,87,63,7
July,43,432,2
August,14,123,4
September,16,67,8
October,29,100,3


By specifying what column is the index, we have removed the automatic assumption that Pandas makes which results in the addition of the index numbers as an additional column. Sweet!

You don't need to specify the column number that is the index, you can instead just specify the value of it, like so. 

In [8]:
csvfile = pd.read_csv('simple_orders.csv', delimiter = ',', index_col="Month") 

csvfile

Unnamed: 0_level_0,T-Shirts,Mugs,Keyrings
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
January,22,43,22
Feburary,10,246,231
March,14,14,34
April,2,10,12
May,4,2,4
June,87,63,7
July,43,432,2
August,14,123,4
September,16,67,8
October,29,100,3


We get the exact same result, just by specifying the column name instead of its index number!

## But! This is just a small dataset...
Very oftenly, datasets such as CSV files are not small in the slightest. Our current method outputs the entire contents, imagine trying to display 4000 rows and 200 columns? Not very effective...

We can instead, just tell Pandas to display a certain number of rows.

In our example we have 12 rows, lets tell Pandas to only display 3. 

**Just because only 3 are being rendered, the entire 12 rows are still loaded**

In [9]:
# You can replace "3" with whatever number you like. Well..upto 12 anyways.

csvfile.head(3)

Unnamed: 0_level_0,T-Shirts,Mugs,Keyrings
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
January,22,43,22
Feburary,10,246,231
March,14,14,34


In [10]:
# You can also just display a certain number of the bottom rowss if you wish! This is fairly common when handling datasets
# So that we can get a good picture of makeup of the dataset without trying to render it all.

# Again, you can replce "3" with whatever number you like.

csvfile.tail(3)

Unnamed: 0_level_0,T-Shirts,Mugs,Keyrings
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
October,29,100,3
November,3,12,1
December,5,23,7


We've managed to tell Pandas to only display "January, Feburary and March" and then later to only display "October, November, December" using `.head(#)` `.tail(#)` Neat! These sort of things are super super common when handling data, so we've nailed a core principle.


### Before we start the pretty stuff:
There's a final little snippet that is again, another super useful and important function in data handling. the `shape`

This is often used when cleaning up datasets (they are often very wishy-washy to what we need, so `.shape` is a great way to display what exactly our filtering has done.

In [11]:
# Simply, rather then rendering any data, lets just figure out a numerical value for what's in there!

csvfile.shape

(12, 3)

Panda's `.shape` processes the specified file, and counts the contents in the following format: **rows, columns**

So in our case, we have **12 rows** and **3 columns** 

Finally, what if we wanted to just count the amount of values in a column or category, such as mugs?

In [12]:
csvfile[['Mugs']].count()

Mugs    12
dtype: int64

## Good job! Answer the questions in the room, and then we'll see how we can plot this data into graphs!