<a href="https://colab.research.google.com/github/getitjessica/testHU/blob/main/Copy_of_1_Introduction_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>  Introduction to pandas </center>
<div>
<img src="https://pandas.pydata.org/static/img/pandas.svg" width="600"/>
</div>

Before, we learned about some useful data structures to store and organize data. These included lists, dictionaries, tuples, and arrays. In this lecture, we will learn about the **pandas** library, some of its features, and new data structures that the library imports.

Pandas is a popular Python library utilized by many data scientists. It offers useful additional functionalities in Python that expand one's capability to store, organize, and analyze data. Additional data structures that comes with the pandas library are ***series*** and ***dataframes***. If you have experience in working with spreadsheets in Microsoft Excel, working with pandas series and dataframes will look familar. Essentially, series and dataframes allow for the storage of data in a tabular format.

For more information on pandas and documentation on functionalities of the pandas library, refer to the <a href="https://pandas.pydata.org/docs/index.html">official pandas webpage</a>. The <a href="https://pandas.pydata.org/docs/reference/index.html">API reference page</a> is an extensive resource for many pandas functions and methods.

# Series

A pandas series can be thought of as a 1D array or a single column in a table.

To make a series, we first must import the pandas library. It is common convention that Pandas is imported as `pd`. Then, we can make a series by using the `pd.Series()` function and utilize a list or array as the input to the function, as shown below:

In [None]:
import pandas as pd

In [None]:
#This is a list of list. So each element in the larger list is a list itself. That individual list holds multiple datatypes (strings and integer(s)).
ancestors_info = [['Malcolm X','Omaha, Nebraska',1925],['Martin Luther King Jr.','Atlanta, Georgia',1929], ['Nana Yaa Asantewaa','Besease, Ghana',1840],['Cater G Woodson','New Canton, VA',1875]]

#Now we can access the pandas module and create a dataframe from the list above. It's important to make sure that the name of columns describe the data correctly.
ancestors_df = pd.DataFrame(ancestors_info,columns=['Name','Place of Birth','Year of Birth'])
ancestors_df.head()


In [None]:
print(ancestors_df)

                     Name    Place of Birth  Year of Birth
0               Malcolm X   Omaha, Nebraska           1925
1  Martin Luther King Jr.  Atlanta, Georgia           1929
2      Nana Yaa Asantewaa    Besease, Ghana           1840
3         Cater G Woodson    New Canton, VA           1875


In [None]:
groceries = pd.Series(['Doritos', 'Bananas', 'Broccoli', 'Chicken'])

groceries

0     Doritos
1     Bananas
2    Broccoli
3     Chicken
dtype: object

Calling `groceries` shows us a series of four items that is indexed from 0 to 3 (inclusive). To name the indices, we can pass another list or array into the `.set_axis()` method.

In [None]:
groceries = groceries.set_axis(['Snack', 'Fruit', 'Vegetable', 'Meat'])

groceries

Snack         Doritos
Fruit         Bananas
Vegetable    Broccoli
Meat          Chicken
dtype: object

When creating a series, the index can also be set. To achieve the same outcome as above, we can use the `index` parameter in the `pd.Series()` function:

In [None]:
groceries2 = pd.Series(['Doritos', 'Bananas', 'Broccoli', 'Chicken'],
         index = ['Snack', 'Fruit', 'Vegetable', 'Meat']
                       )

groceries2

Snack         Doritos
Fruit         Bananas
Vegetable    Broccoli
Meat          Chicken
dtype: object

Series can be indexed by position (numerically) similar to indexing an array. They can also be indexed by value or name. Below we index a single item and a range of items numerically and by name:

In [None]:
print(groceries2[2])
print(groceries2["Fruit"])

Broccoli
Bananas


In [None]:
print(groceries2["Fruit":"Meat"])
print('\n')                          # Prints a new line
print(groceries2[1:4])

Fruit         Bananas
Vegetable    Broccoli
Meat          Chicken
dtype: object
Fruit         Bananas
Vegetable    Broccoli
Meat          Chicken
dtype: object


Notice that when slicing by a defined value or name, the end of the slice will be **included**. When slicing by an index position, the end of the slice will be **excluded**.

# Dataframes

Dataframes are essentially tables that consist of multiple series. Below, we see the multiple ways a dataframe can be made.


### Making a dataframe using a dictionary
To construct a dataframe using a dictionary, you can use the `pd.Dataframe()` function. Passing a dictionary into this function creates a dataframe along a column-axis; the keys of the dictionary become the column titles, while the values of each key become the rows of each column. The `index` parameter can also be passed into the function as well but must be defined outside of the dictionary as a list:

In [None]:
ancestors_df

Unnamed: 0,Name,Place of Birth,Year of Birth
0,Malcolm X,"Omaha, Nebraska",1925
1,Martin Luther King Jr.,"Atlanta, Georgia",1929
2,Nana Yaa Asantewaa,"Besease, Ghana",1840
3,Cater G Woodson,"New Canton, VA",1875


In [None]:
grocery_df1 = pd.DataFrame(
    {"Item": ['Doritos', 'Bananas', 'Broccoli', 'Chicken'],
     "Unit Price": [3.99,0.50,2.00, 5.00],
     "Quantity": [2,5,1,3]},
    index = ['Goodies', 'Fruit', 'Vegetable', 'Meat']
)

print(grocery_df1)
grocery_df1

               Item  Unit Price  Quantity
Goodies     Doritos        3.99         2
Fruit       Bananas        0.50         5
Vegetable  Broccoli        2.00         1
Meat        Chicken        5.00         3


Unnamed: 0,Item,Unit Price,Quantity
Goodies,Doritos,3.99,2
Fruit,Bananas,0.5,5
Vegetable,Broccoli,2.0,1
Meat,Chicken,5.0,3


### Making a dataframe using lists

Another way to construct a dataframe is by using lists. While passing lists into the `pd.Dataframe()` function, a dataframe is constructed along a row-axis; each list becomes a single row in the dataframe. Using this method, columns can be named by passing a list into the `columns` parameter:

In [None]:
grocery_df2 = pd.DataFrame(
    [['Doritos', 3.99, 2],
     ['Bananas', 0.50, 5],
     ['Broccoli', 2.00, 1],
     ['Chicken', 5.00, 3]],
    index = ['Snack', 'Fruit', 'Vegetable', 'Meat'],
    columns = ["Item","Unit Price","Quantity"]
    )

grocery_df2

Unnamed: 0,Item,Unit Price,Quantity
Snack,Doritos,3.99,2
Fruit,Bananas,0.5,5
Vegetable,Broccoli,2.0,1
Meat,Chicken,5.0,3


### Making a dataframe using pandas series

Finally, multiple series can be used to construct a dataframe by using the `pd.concat()` function. Using this function, one can pass a list of series and specify the method of concatenation/joining through the `axis` parameter. Concatenation while `axis = 0` means that the series will be joined as additional rows; concatenation while `axis = 1` means that the series will be joined together as columns. The `keys` parameter sets the column titles and should be defined using a list.

Once the series have been concatenated into a dataframe, the indexes of the dataframe will start at 0 by default. The `.set_axis()` method can be used on the dataframe to define the index values/titles:

In [None]:
items = pd.Series(['Doritos', 'Bananas', 'Broccoli', 'Chicken'])
unit_price = pd.Series([3.99,0.50,2.00, 5.00])
quantity = pd.Series([2,5,1,3])
indices = pd.Series(['Snack', 'Fruit', 'Vegetable', 'Meat'])

grocery_df3 = pd.concat([items, unit_price, quantity], axis = 1, keys = ['Items', 'Unit Prices', 'Quantity'])


grocery_df3 = grocery_df3.set_axis(indices)

grocery_df3

Unnamed: 0,Items,Unit Prices,Quantity
Snack,Doritos,3.99,2
Fruit,Bananas,0.5,5
Vegetable,Broccoli,2.0,1
Meat,Chicken,5.0,3


### Loading Data

Python can read several types of files. Below are some useful functions to load data files:

- `pd.read_csv(file)` : Loads comma separated values files (.csv files). Requires `pandas` to be imported first.

- `pd.read_excel(file)` : Loads Microsoft Excel files (.xlsx files). Requires `pandas` to be imported first.

- `open(file, mode)` : Loads text files (.txt files); The `mode` parameter is optional and determines how the file is opened. When `mode` is not specified, the default argument `'r'` is passed, which reads the file.

We will use `pd.read_csv()` to read data stored in comma separated values (.csv) files in this lecture.

When loading data into Python, we can store the csv file as a dataframe for downstream processing and analysis. Below, we load a file called `sample_data.csv` using `pd.read_csv()` and save it as a dataframe called `active_player`:


In Google Colab:
1. upload the file
2. use the file path ('/active_player.csv')

In [None]:
players = pd.read_csv("/content/active_players.csv")

playersv2 = players
playersv2
players

Unnamed: 0,Name,Team,Position,Age,Height,Height_i,Weight,College,Salary
0,Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.50,185,,
1,Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.60,223,California,26758928.0
2,Kris Dunn,Boston Celtics,PG,27,"6' 3""",6.30,205,Providence,5005350.0
3,Carsen Edwards,Boston Celtics,PG,23,"5' 11""",5.11,200,Purdue,1782621.0
4,Tacko Fall,Boston Celtics,C,25,"7' 5""",7.50,311,UCF,
...,...,...,...,...,...,...,...,...,...
553,Juwan Morgan,Utah Jazz,SF,24,"6' 7""",6.70,232,Indiana,
554,Royce O'Neale,Utah Jazz,PF,28,"6' 4""",6.40,226,Baylor,8800000.0
555,Olumiye Oni,Utah Jazz,SG,24,"6' 5""",6.50,206,Yale,1782621.0
556,Eric Paschall,Utah Jazz,F,24,"6' 6""",6.60,255,Villanova,1782621.0


When calling for `players`, we see a dataframe that contains data on the height of various different house plants sold at a local florist.

To set a column as the index of the dataframe, the `.sex_index()` method can be used. The name of the desired column can be passed as a string into the method:


In [None]:
players = players.set_index('Name')
players

Unnamed: 0_level_0,Team,Position,Age,Height,Height_i,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.50,185,,
Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.60,223,California,26758928.0
Kris Dunn,Boston Celtics,PG,27,"6' 3""",6.30,205,Providence,5005350.0
Carsen Edwards,Boston Celtics,PG,23,"5' 11""",5.11,200,Purdue,1782621.0
Tacko Fall,Boston Celtics,C,25,"7' 5""",7.50,311,UCF,
...,...,...,...,...,...,...,...,...
Juwan Morgan,Utah Jazz,SF,24,"6' 7""",6.70,232,Indiana,
Royce O'Neale,Utah Jazz,PF,28,"6' 4""",6.40,226,Baylor,8800000.0
Olumiye Oni,Utah Jazz,SG,24,"6' 5""",6.50,206,Yale,1782621.0
Eric Paschall,Utah Jazz,F,24,"6' 6""",6.60,255,Villanova,1782621.0


We can obtain a list of the indices by calling `.index` on the dataframe:

In [None]:
players.index

Index(['Juhann Begarin', 'Jaylen Brown', 'Kris Dunn', 'Carsen Edwards',
       'Tacko Fall', 'Bruno Fernando', 'Al Horford', 'Enes Kanter',
       'Luke Kornet', 'Romeo Langford',
       ...
       'Rudy Gobert', 'Elijah Hughes', 'Ersan Ilyasova', 'Joe Ingles',
       'Donovan Mitchell', 'Juwan Morgan', 'Royce O'Neale', 'Olumiye Oni',
       'Eric Paschall', 'Hassan Whiteside'],
      dtype='object', name='Name', length=558)

By default, Python shows the first 60 rows of a dataframe. If a dataframe exceeds 60 rows, a truncated version is displayed that shows the first and last five rows.

When working with larger datasets, the number of rows that are shown can be adjusted when the dataframe is used in the `pd.set_option()` function. This function takes two arguments: the option you want to set and the value to which to set the option.

By passing `'display.max_rows'` as the first argument and the number `15` as the second argument, we set the option in our environment to display a truncated version of any dataframe that exceeds 15 rows. A similar approach could be taken with columns by passing `'display.max_columns'` as an argument. This can be useful if you would like to see all the rows and columns of a dataframe or if you want to abbreviate the dataframe after a certain number of rows and columns:

In [None]:
pd.set_option('display.max_rows', 4)
players

Unnamed: 0_level_0,Team,Position,Age,Height,Height_i,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.5,185,,
Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.6,223,California,26758928.0
...,...,...,...,...,...,...,...,...
Eric Paschall,Utah Jazz,F,24,"6' 6""",6.6,255,Villanova,1782621.0
Hassan Whiteside,Utah Jazz,C,32,"7' 0""",7.0,265,Marshall,1669178.0


# Exploratory methods and functions for dataframes

When loading in data as dataframes, especially large datasets, you may want to get a quick overview of the data.

Two very useful dataframe methods that can help with this are the `.head()` and `.tail()` methods.  By default, calling `.head()` or `.tail()` on a dataframe will return the first five or the last five rows of the dataframe, respectively. Passing an integer into these methods will give you an output with that number of rows:

In [None]:
players.head(15)

Unnamed: 0,Name,Team,Position,Age,Height,Height_i,Weight,College,Salary
0,Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.5,185,,
1,Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.6,223,California,26758928.0
2,Kris Dunn,Boston Celtics,PG,27,"6' 3""",6.3,205,Providence,5005350.0
3,Carsen Edwards,Boston Celtics,PG,23,"5' 11""",5.11,200,Purdue,1782621.0
4,Tacko Fall,Boston Celtics,C,25,"7' 5""",7.5,311,UCF,
5,Bruno Fernando,Boston Celtics,F,23,"6' 9""",6.9,240,Maryland,1782621.0
6,Al Horford,Boston Celtics,C,35,"6' 9""",6.9,240,Florida,27000000.0
7,Enes Kanter,Boston Celtics,C,29,"6' 10""",6.1,250,Kentucky,1669178.0
8,Luke Kornet,Boston Celtics,C,26,"7' 2""",7.2,250,Vanderbilt,
9,Romeo Langford,Boston Celtics,SG,21,"6' 4""",6.4,216,Indiana,3804360.0


In [None]:
players.tail(5)

If a negative integer *n* is passed into the `.head()` method, all but the last *n* rows will be shown.

Likewise, if a negative integer *h* is passed into the `.tail()` method, all but the top *h* rows will be shown:

In [None]:
players.head(2)

Unnamed: 0_level_0,Team,Position,Age,Height,Height_i,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.5,185,,
Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.6,223,California,26758928.0


In [None]:
players.tail(-11)

Unnamed: 0_level_0,Team,Position,Age,Height,Height_i,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Jabari Parker,Boston Celtics,PF,26,"6' 8""",6.8,245,Duke,2283034.0
Payton Pritchard,Boston Celtics,PG,23,"6' 1""",6.1,195,Oregon,2137440.0
Josh Richardson,Boston Celtics,SG,27,"6' 5""",6.5,200,Tennessee,11615328.0
Dennis Schroder,Boston Celtics,PG,27,"6' 3""",6.3,172,,5890000.0
Marcus Smart,Boston Celtics,PG,27,"6' 3""",6.3,220,Oklahoma State,14339285.0
...,...,...,...,...,...,...,...,...
Juwan Morgan,Utah Jazz,SF,24,"6' 7""",6.7,232,Indiana,
Royce O'Neale,Utah Jazz,PF,28,"6' 4""",6.4,226,Baylor,8800000.0
Olumiye Oni,Utah Jazz,SG,24,"6' 5""",6.5,206,Yale,1782621.0
Eric Paschall,Utah Jazz,F,24,"6' 6""",6.6,255,Villanova,1782621.0


When working with large datasets with many variables, the `.columns`, `.shape`, and `.size` attributes and the `.info()` method can be helpful.

The `.columns` attribute returns a list of column names when called on a dataframe:

In [None]:
players.shape

NameError: ignored

The `.shape` attribute returns a tuple of the number of rows and columns within the dataframe:

In [None]:
players.shape
players


Unnamed: 0_level_0,Team,Position,Age,Height,Height_i,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Juhann Begarin,Boston Celtics,SG,19,"6' 5""",6.5,185,,
Jaylen Brown,Boston Celtics,SG,24,"6' 6""",6.6,223,California,26758928.0
...,...,...,...,...,...,...,...,...
Eric Paschall,Utah Jazz,F,24,"6' 6""",6.6,255,Villanova,1782621.0
Hassan Whiteside,Utah Jazz,C,32,"7' 0""",7.0,265,Marshall,1669178.0


The `.size` attribute returns the number of total data points within the dataframe (i.e. the number of rows * the number of columns):

In [None]:
players.size

4464

Lastly, the `.info()` method provides information on the dataframe, including the range of the indexes, the data type of each column, and the memory usage of the dataframe:

In [None]:
players.info()

<class 'pandas.core.frame.DataFrame'>
Index: 558 entries, Juhann Begarin to Hassan Whiteside
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Team      558 non-null    object 
 1   Position  558 non-null    object 
 2   Age       558 non-null    int64  
 3   Height    558 non-null    object 
 4   Height_i  558 non-null    float64
 5   Weight    558 non-null    int64  
 6   College   485 non-null    object 
 7   Salary    445 non-null    float64
dtypes: float64(2), int64(2), object(4)
memory usage: 39.2+ KB


# Activity


1. Load the dataframe from your group into Colab. Make the `"zip code' column the index of the dataframe.

4. Explore the dataset. How many entries are there? How many variables are documented for each entry? What are the data types of each variable?