# SLU1 - Pandas 101: Learning notebook

In this notebook we will be covering the following:

- What is Pandas
- Series
- Dataframes
- Previewing a dataframe
- Columns
- Shape
- Reading data from disk
- Writing data to disk
- Info
- Describe

## What is pandas?

Pandas is a major tool of interest. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.

In this notebook the most basic functionalities will be covered.

### How do I call it?

In [None]:
import pandas as pd

Notice that we import pandas as "pd". This is not required, but it's standard. 

### Pandas Data Structures

There are two main data structures on pandas:
- **Series** - A One-dimensional array of data of the same type. More documentation on Series is available on: [Pandas Series Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

![Pandas Series](assets/serie.png "Pandas Series")

- **Dataframes** - Tabular structure that may be seen as a container of series (that may have different types).Be aware that is also possible to have one-dimensional array of data as a DataFrame. More documentation on Dataframes is available on: [Pandas Dataframes Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

![Pandas DataFrame](assets/dataframe.PNG "Pandas Dataframe")


---

## Series

Creating a series in pandas is really easy. We will start by creating a series of numbers and print it to see how it look likes.

[Pandas Series Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

In [None]:
s1 = pd.Series([10, 3, 5, 1, 12])
s1

We can see that the values we have defined as data, with a given index (from 0 to the length of the data minus 1). Notice as well that the series has one and only one type of data, in this case `int64`. Pandas is quite clever inferring what kind of data is passed to it. Additonally it's possible to observe that the order of the data has been maintained.

Let's see what would have happened if we had passed it some floats, instead of ints: 

In [None]:
s2 = pd.Series([0.5, 0.2, 5.2, 1.6, -0.6])
s2

Ok, so now it's a `float64` series. 

Next up, the same, but this time with some strings: 

In [None]:
s3 = pd.Series(["Google", "Microsoft", "Facebook", "Apple"])
s3

Ok, this time it was considered `object`. Objects are types of data that can point to different types of data as we may see in the example below.

Fair question: what happens if you pass it a mix of stuff? 

In [None]:
s4 = pd.Series([1, 2.3, "omg a string", 2])
s4

Well, when everything is mixed, it makes it an object! 

Series have a class attribute that shows us their data type. It's called dtype and can be used like this:

In [None]:
s1.dtype

In [None]:
s4.dtype

Note that series `s1` have the dtype integer and `s4` have the dtype object.

# Indexing 

You will have noticed that our Series so far have a bunch of numbers on the left _(0, 1, 2, 3...)_. 

Those values represent the index, which is used for (among other things) selecting. 

Even though by default the index is _0, 1, 2, 3..._ it is often useful to set a different index. 

Here is an example: 

In [None]:
s5 = pd.Series(data=["Larry", "Bill", "Mark", "Steve"], 
               index=["Google", "Microsoft", "Facebook", "Apple"])
s5

We wanted `Bill` to have the index `Microsoft`. Now we can actually treat this a bit like a dictionary: 

In [None]:
s5['Microsoft']

We can also get all the values (still a bit like a dictionary): 

In [None]:
s5.values

Or the indexes (like the `.keys()` of the dictionary) 

In [None]:
s5.index

Speaking of dictionaries, can I make a Pandas Series from a dictionary? 

In [None]:
my_dict = {"Google": "Larry",
           "Microsoft": "Bill",
           "Facebook": "Mark",
           "Apple": "Steve"}

s6 = pd.Series(my_dict)
s6

The Series class will automatically use the keys of the dictionary as indexes of the series and its correponding data as the data of the series as well. The interesting part of using this is that we now are able to have some functionalities that we usually don't have in dictionaries.

In [None]:
try:
    my_dict[-1:]
    
except:
    print("Illegal operation")

In [None]:
s5[-1:]

In the latest version of pandas, instead of using the method `.values` they recommend to use one of the following depending on whether you need a reference to the underlying data or a NumPy array, respectively: `.array` and `.to_numpy`.
The reason for that is that `.values` sometimes gives you an numpy array and other times it gives you an ExtensionArray and with the new methods there is a clear understanding to which you want to return.

If you use the method `.array` in the series `s5`, you get a 'PandasArray' as you can see (this method work for Serie and Index).

In [None]:
s5.array

In [None]:
s5.index.array

If you have a different type of data, like `period`, when you use the `.array` method, you get a 'PeriodArray'.

In [None]:
s6 = pd.period_range('2000', periods=4)

In [None]:
s6.array

If you need an actual numpy array, you can do:

In [None]:
s5.to_numpy()

In [None]:
s5.index.to_numpy()

#### Key points:

- Each series has only one data type (even if it is a more inclusive one, like object).
- A list of indexes might be used (it has to have the same dimension).
- It is possible to use dictionaries to create series.
- There is a various methods to use to get info from a Series.

---

## DataFrames

As mentioned previously, a dataframe is a tabular structure (think "Excel sheet"). This will become clear with the following examples.

[Pandas Dataframes Documentantion](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

Let's make our first DataFrame: 

In [None]:
df1 = pd.DataFrame([10,122,1])

In [None]:
df1

This first dataframe is a really simple one. We can see that in this case we have columns and indexes. This shows the tabular structure. But the next case will highlight this even further.

In [None]:
df2 = pd.DataFrame([[1,   2,   3,    7],  # ignore the weird spacing, it's just to be clear we have 3 lists of 4 
                    [4.2, 6.1, 8.9, -4.1], 
                    ["a", "b", "c", "z"] ])

In [None]:
df2

Notice that the way this dataframe is being created leads to a row of values for each list of data provided. It is also possible to provide a list of names for each of the columns...

In [None]:
df3 = pd.DataFrame([[1,   2,   3,    7], 
                    [4.2, 6.1, 8.9, -4.1], 
                    ["a", "b", "c", "z"] ],
                    columns=["col_1", "col_2", "col_3", "col_4"])   # <-- The column names! 

In [None]:
df3

as well for each of the rows.

In [None]:
df4 = pd.DataFrame([[1,   2,   3,    7], 
                    [4.2, 6.1, 8.9, -4.1], 
                    ["a", "b", "c", "z"] ],
                    columns=["col_1", "col_2", "col_3", "col_4"],  # <-- The column names
                    index=["row_1", "row_2", "row_3"])   # <-- The row names

In [None]:
df4

So far we've been creating DataFrames from lists, like so: 

In [None]:
company = ["Google", "Microsoft", "Facebook", "Apple"]
founder_name = ["Larry", "Bill", "Mark", "Steve"]
founder_surname = ["Page", "Gates", "Zuckerberg", "Jobs"]

df5 = pd.DataFrame( [ company, founder_name, founder_surname])

In [None]:
df5

But we can also do something cool, which is to make a dictionary with the lists as values, where the keys will be the column names. Let's create the dictionary first, using the lists we have defined above: 

In [None]:
tech_companies_dictionary = {
    'company': ["Google", "Microsoft", "Facebook", "Apple"],
    'founder_name': ["Larry", "Bill", "Mark", "Steve"],
    'founder_surname': ["Page", "Gates", "Zuckerberg", "Jobs"],
}

This is super readable, right? 

In [None]:
tech_companies_dictionary

Now we can simply pass this to a Pandas DataFrame: 

In [None]:
df6 = pd.DataFrame(tech_companies_dictionary)

In [None]:
df6

By passing a dictionary as input to the creation of the dataframe, the dataframe is now able to use the key of the dictionary as the name of the column and present the data along a column, instead of along a row. This is becoming closer to how information is usually presented.

----

# Putting it all together 

Let's do the same thing again, using everything we've learned so far:  

In [None]:
# Let's say we have these lists somewhere on our computer: 
founder_names = ["Larry", "Bill", "Mark", "Steve", "Larry", "Reed"]
founder_surnames = ["Page", "Gates", "Zuckerberg", "Jobs", "Ellison", "Hastings"]
company = ["Google", "Microsoft", "Facebook", "Apple", "Oracle", "Netflix"]

Let's make some Series, using the company name as index: 

In [None]:
series_of_founder_names = pd.Series(data=founder_names, # <-- data 
                                    index=company)      # <-- index 

In [None]:
series_of_founder_names

Same thing, this time for surnames: 

In [None]:
series_of_founder_surnames = pd.Series(data=founder_surnames, # <-- different data
                                    index=company)        # <-- same index 

In [None]:
series_of_founder_surnames

Now with these two Series we can create a dataframe! Pandas will notice that they have the same index, and will give the DataFrame that index: 

In [None]:
df7 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames})

In [None]:
df7

By passing series (in this case sharing the index) as values of a dictionary, the model is able to use the key value as column name and the index as the row name. The column and index(row) are also acessible, as will be shown below.

### What if my data isn't a Pandas Series?

It will often happen that you have a list or array:

In [None]:
number_of_employees = [73992, 124000, 20658, 123000, 138000, 5400]

In [None]:
series_of_number_employees = pd.Series(data=number_of_employees) # <-- data, no index 

Now, you may be tempted to add this directly to the DataFrame, and Pandas won't stop you:

In [None]:
df8 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames,
                    'number_employees': series_of_number_employees})

In [None]:
df8

You should however notice that this is a **lot more dangerous**, as you are making the assumption that the rows are in the same order as the list.

In practice, you will probably end up doing this out of time contraints, or reading other people's code where lists are added directly without an index. But remember: if you have an index you are safer.

-----

### Getting the index and column values 

This dataframe object contains some cool attributes, among which are the following: 

Get index, with `.index`: 

In [None]:
df8

In [None]:
df8.index

Get the columns, with `.columns`: 

In [None]:
df8.columns

Among other things, this might be used to iterate over the column titles.

In [None]:
for col in df8.columns:
    print(col)

We can also use `dtypes` to know the type of each series of the dataframe:

In [None]:
df8.dtypes

As mentioned above with Series, you also have the method `.values` in a dataframe, but it's probably better to start using the `.to_numpy` (performance timings!)

Note: DataFrame doesn't have the method `.array`

In [None]:
df8.to_numpy()

#### Key points:

- DataFrames may be seen as a tabular structure (named rows and columns).
- We can define the indexes and columns as we create the dataframe.
- It's possible to take advantage of dictionaries and Series to create DataFrames.

---

## Previewing a DataFrame

#### Visualizing the DataFrame or part of it

To visualize a DataFrame, using a jupyter-notebook, printing will display it (as seen previously).

In [None]:
df8

In the case that the dataframe has a lot of entries, it will be only partially displayed. Nonetheless, it might still be too much information being displayed at once and the methods that are going to be used below often prove to be a better alternative. Namely, it is possible to print only a certain number of entries from the top or from the bottom using `.head` and `.tail`, respectively.

In [None]:
df8.head(n=2)

In [None]:
df8.tail(n=2)

---

## Retrieving DataFrame Information

#### Getting the relevant info

With pandas' [info](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) it possible to obtain:
- How many entries it has.
- The total number of columns.
- The title of each column.
- The number of entries that in fact exists in each column. Does not consider misssing values.
- The type of data of the entries of a given column.

In [None]:
df8.info()

For the **NUMERICAL** variables it's also possible to print some more information using [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html), namely:

- The number of rows for each of those columns.
- The mean value.
- The standard deviation.
- The minimum and maximum value.
- The median, the 25th and 75th percentile.

In [None]:
df8.describe()

Finally, [shape](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) returns a tuple with the dimensions of the dataframe (nr_rows, nr_columns).

In [None]:
df8.shape

### Key points:

- It's possible to print the dataframe (still shows too many lines, might be too "noisy").
- `head()` and `tail()` print the n top and bottom, respectively, lines of the dataframe.
- Info returns the number of entries, the number of columns, their counts and the data type.
- Describe returns basic statistical information of the numeric columns.

---

## Reading from the disk

Pandas framework has implemented functions that allow us to create dataframes form several different types of data:

- CSV
- JSON
- HTML
- ... and [many more](https://pandas.pydata.org/pandas-docs/stable/io.html)

All of this is possible by using the read_*dataFormat*. With it is possible to create a dataframe and apply all the previously shown techniques. 

For instance, using the 2010 census profile and housing characteristics of the city of Los Angeles ([source](https://catalog.data.gov/dataset/2010-census-populations-by-zip-code)):

In [None]:
data = pd.read_csv("data/2010_Census_Populations_by_Zip_Code.csv")

What does this look like?

In [None]:
data.head(5)

Let's use `info`: 

In [None]:
data.info()

Now for a fuller description:

In [None]:
data.describe()

### Key points:

- Pandas allows the creation of dataframes from several structures of data.

----

## Writing to the disk

Besides reading from the disk, Pandas allows us to also write and save our dataframe after we performed some transformations to the data.

The same way we can read data from various data types, we can also write data to various data types (CSV, JSON, HTML, ...)

All of this is possible by using the to_*dataFormat*, giving as an argument the path where you want to save the file:

In [None]:
data.to_csv("data/new_csv.csv")

---

## To learn more (optional)

- [Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)