In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("Intro_to_tables.ipynb")

# Rectangular Data

Data is produced from various sources in forms like text, images, videos, and sensor readings. A significant portion of this data is unstructured, meaning it does not adhere to a specific data model or format. For instance, unstructured data includes textual content like books or articles, which can be organized in different ways (e.g., chapters, headings) but lack a consistent structure across all texts. Other examples of unstructured data include social media posts, emails, and multimedia files such as photos and videos.

In contrast, some data is organized in a well-defined format with rows and columns, similar to what you find in an Excel spreadsheet. This type of data is referred to as **rectangular data**.

Rectangular data refers to a two-dimensional matrix where rows represent individual records (data points) and columns represent features (attributes or variables). In Python, the term **dataframe** specifically denotes this type of rectangular data, especially when using the pandas library. In this course, we will primarily concentrate on working with rectangular data.

### Reading Rectangular Data in Pandas

To handle structured data in Python, we can utilize the pandas module. For CSV files, the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method is used. For rectangular data in text files, the `read_txt` method can be employed, and for Excel files, the `read_xls` method is applicable.

Let's see these methods in action. You have been given `Nascar Driver Results`. Driver results for all NASCAR races between 1975 and 2003, inclusive. The dataset constitutes all participants in each of 898 races, and includes their start/finish positions, prize winnings, car make and laps completed. For more information about the dataset, read [here](https://jse.amstat.org/datasets/nascard.txt)

There are total 34,884 observations and 10 features. The feature description is given below:

1. **Series Race**: The race id (int)
2. **Year**: the year the race occured (int)
3. **Race/Year**: the race id within that year (int)
4. **Fin Pos**: the finishing position, 1 represents the winner (int)
5. **Start Pos**: the starting position (int)
6. **Laps Completed**: the number of laps completed (int)
7. **Winnings**: the prize winnings, in dollars (int)
8. **Total Cars**: the total number of cars in the race (int)
9. **Car Make**: the make of the car (str)
10. **Driver**: the name of the driver (str)

In [None]:
import pandas as pd

In [None]:
nascar = pd.read_csv('data/nascar.csv')

You'll see that the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method has many parameters. To understand the purpose of each one, you can consult the documentation. For now, let's concentrate on the first parameter, `filepath_or_buffer`

<img src="pics/read_csv.png" alt="read_csv_filepath_or_buffer" width="500"/>

Essentially, this means that the first parameter of this method should be a string representing the URL, address, or file path from which to read the CSV file. In our case, since the file is located in the data directory, we specify the file path as `data/nascar.csv`.

Let's check the data type of the `nascar` variable. Remember, a variable's data type is determined by the value it holds. In pandas, any tabular or rectangular data is represented as a **DataFrame**.

In [None]:
type(nascar)

Let's now display the contents of the `nascar` dataframe. 

Pandas DataFrame has a [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) method, which is used to display the first 5 rows of the dataframe. 

In [None]:
nascar.head()

### Question 1
Write a python code to display the first 10 rows of the `nascar` dataframe.


In [None]:
...

### Question 2
Write a python code to display the last 3 rows of the `nascar` dataframe. 

Hint: Read dataframe [`tail`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) method. 


In [None]:
...

### Question 3

You have a `cancer.dat` file located in the data directory. Write Python code to correctly load this file into a DataFrame so that the resulting table appears as shown below:

<img src="pics/cancer.png" width=200 />

Do not make any change to the `cancer.dat` file. Read [here](https://grodri.github.io/glms/datasets/#cancer) for more detail about the dataset. 

Hint: You can use `read_csv` method to read `.dat` files. Moreover, check the **delimiter** parameter of the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. 

In [None]:
cancer = ...
cancer.head()

In [None]:
grader.check("q3")

### Question 4

You have [`copen.data`](https://grodri.github.io/glms/datasets/#copen) file located in data directory. 

Kevin, a student in the Data Analytics program tries to read the datafile as follows: 

In [None]:
copen = pd.read_csv('data/copen.dat', delimiter=';')
copen.head()

He was puzzled as to why the first row of the DataFrame was being interpreted as column names. Upon manually inspecting the copen.dat file, he discovered that it lacks column names. Assist Kevin by assigning the column names `housing, influence, contact, satisfaction, n` to the DataFrame.

Do NOT alter the `copen.dat` file.

Hint: Check the **names** parameter in the `read_csv` method.

The first 5 rows of your DataFrame should appear as follows:

<img src="pics/names.png" width=200 />


In [None]:
copen = ...
copen.head()

In [None]:
grader.check("q4")

### Question 5

You are given `brazil.dat` file located in data directory. Kevin tries to read this file; however he gets the following table. 


In [None]:
brazil = pd.read_csv('data/brazil.dat')
brazil.head()

After examining the `brazil.dat` file, Kevin noticed that the first two lines contain descriptions about the dataset, which he does not want to include in the DataFrame. Help Kevin read the data file correctly, ensuring that these initial lines are skipped. Do NOT modify the `brazil.dat` file. The first 5 rows of the resulting DataFrame should look like this:

<img src="pics/header.png" width=200 />

Hint: Look into **header** parameter of the `read_csv` method. 

In [None]:
brazil = ...
brazil.head()

In [None]:
grader.check("q5")

### Question 6

Kevin is only interested in reading the `Age-group` and `Frequency` columns into the DataFrame. Help him achieve this. The first five rows of the resulting DataFrame should appear as follows:

<img src='pics/usecols.png' width=200 />

Hint: Look into **use_cols** parameter of the `read_csv` method. 

In [None]:
brazil = ...
brazil.head()

In [None]:
grader.check("q6")

### Determing the number of rows and columns of a dataframe

The `shape` property of a dataframe returns a tuple with number of rows and columns of a dataframe. 

In [None]:
nascar.shape # Do NOT use rounded brackets () after shape, as it is a property and not a method/function

In [None]:
nrows = nascar.shape[0]
ncols = nascar.shape[1]
print(f"The number of rows: {nrows}, The number columns: {ncols}")

### Question 7

Write a python code to determine the number of rows and columns in the `cancer` dataframe (defined in Question 3). Store your answer in variable `nrows` and `ncols` for the number of rows and the number of columns respectively.

In [None]:
nrows = ...
ncols = ...
print(f"The number of rows: {nrows}, The number columns: {ncols}")

In [None]:
grader.check("q7")

### Indexing in DataFrame

A dataframe maintains two kinds of indices; numbered index and named index. 
Numbered index are automatically assigned to each row and each column of a dataframe. When you printed the first few rows of the dataframe using head method, you may have noticed that pandas automatically added a number starting from 0 to each row of the dataframe. These numbers are the numbered index for a row. Similary, pandas maintains a column index number for each column. Both the indices starts with 0. 

Moreover, a dataframe also maintains named index. That is you can assign a name to each item of a dataframe. For eg: the row#1 can be assigned a name say, "xxx", and row#2 can be assigned a name, say, "yyy". Similarly, each column can be assigned a name in the dataframe. Actually, each column name is nothing but a named index for each column of the dataframe. Therefore, row#1 has numbered index 1, and column#1 has numbered index 0 and named index `Race`. 

### Accessing items of a dataframe

You can access items of a dataframe using `loc` and `iloc` property. `loc` is used to access items using named index, whereas `iloc` is used to access items based on named index. 

In [None]:
nascar.head()

In [None]:
race_column = nascar.loc[:, 'Race'] # Select all rows and Race column
race_column

Let's see what's the type of the column returned in the previous cell. 

In [None]:
type(race_column)

### Pandas Series 

A Pandas series is a **one-dimensional array**. It holds any data type supported in Python and uses labels to locate each data value for retrieval. These labels can be named or numbered index. 

#### How to create a pandas Series?

In [None]:
names = pd.Series(data = ["john", "mary"]) 
names

In [None]:
type(names)

In [None]:
# You can also assign named index to each item of a series

names = pd.Series(data=['john', 'mary'], index=['A', 'B'])
names

To access the values only from a pandas series, you can use the `values` property:

In [None]:
names.values

To access the index of a series, you can use the `index` property:

In [None]:
names.index

### Question 8:

Write a python code to create a pnadas series which takes three car owner names, `John, David and Kyla` as input values, and assign the vehicle plate number `BN1950, SA3593, TL5959` as named index. 


In [None]:
car_owners = ...
car_owners

In [None]:
grader.check("q8")

### How to create a dataframe? 

To create a dataframe, we can use the `pd.DataFrame` method. 

In [None]:
population = pd.DataFrame({'Country': ['India', 'China', 'USA'], 
                           'Population': [1425423212, 1425179569,341534046]
                          })
population

Each row or each column of a dataframe is a pandas series

In [None]:
countries = pd.Series(['India', 'China', 'USA'])
headcounts = pd.Series([1425423212, 1425179569,341534046])
population = pd.DataFrame({'Country': countries, 'Population': headcounts}) # Dictionary with each column added as pandas series
population

### Setting index to a dataframe

You can set a column of a dataframe as index using `set_index` method. 

You may also set an index for each row of a dataframe using `set_index` method. 

In [None]:
population = population.set_index('Country')
population

#### To unset the index labels to column, we can use the `reset_index` method. 

In [None]:
population = population.reset_index()
population

#### To set new column as index 

In [None]:
population = population.set_index(pd.Series(["Asia", 'Asia', 'North America']))
population

#### Access a row/column of a dataframe using `iloc` property

In [None]:
population.iloc[0] # returns the first row of the dataframe

In [None]:
population.iloc[:, 0] # returns the first column of the dataframe

#### Acess cell(s) of a dataframe using `iloc` property

In [None]:
population.iloc[2, 1] # returns the third row and second column of the dataframe

In [None]:
population.iloc[:2, 1] # you can do slicing as well

#### Access a row/column of a dataframe using `loc` property

In [None]:
population.loc['Asia']

In [None]:
population.loc[:, 'Country']

In [None]:
population.loc[:, ['Country', 'Population']]

#### Access cell(s) of a dataframe using `loc` property

In [None]:
population.loc['Asia', 'Population']

### Question 9:

Write a python code to create the following table: 
| Congress Member | Party | 
| --- | ---- |
| Daniel Webster | R | 
| Jared Huffman | D | 
| Michael McCaul | R | 
| Al Green | D | 

In [None]:
cong_members = ...
cong_members

In [None]:
grader.check("q9")

### Question 10

Write a python code to extract the second row of the dataframe, `cong_members`. Your answer should be of type pandas series. 


In [None]:
second_row = ...
second_row

In [None]:
grader.check("q10")

### Question 11

Write a python code to add state names `Florida, California, Texas, Texas` as named labels for each row in the dataframe in that order. 


In [None]:
states = ['Florida', 'California', 'Texas', 'Texas']
cong_members = ...
cong_members

In [None]:
grader.check("q11")

### Question 12

Write a python code to access all the congress members names using `iloc` property. Your answer should be a pandas series. 


In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q12")

### Question 13

Write a python code to access the name of the congress member who is from Florida using `iloc` property. Your answer should be a string. 


In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q13")

### Question 14

Write a python code to access the name of the congress member from Califoria using `loc` property. Your answer should be a string. 


In [None]:
cong_members

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q14")

### Question 15

Write a python code to access all the congress member names from Texas using `iloc` property. Your answer should be a pandas series. 


In [None]:
cong_members

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q15")

### Question 16

Write a python code to access all the congress member names from Texas using `loc` property. Your answer should be a pandas series. 



In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q16")

### Drop columns of a dataframe

You can also **drop** columns from a table if it's of no importance for data exploration. Pandas provides `drop` method for it. 

In [None]:
nascar.head()

In [None]:
nascar.drop('Race/Year', axis='columns').head()

Now that we have dropped `Race/Year` column from the table, let's print the table again, to make sure this column doesn't exist in the dataframe. 

In [None]:
nascar.head()

Oops!!! What happened? Why is the `Race/Year` column still present in the dataframe? 

When you apply a method to pandas, then modification of the table is applied to the copy of the original dataframe. if you meant to make the modification to the original dataframe, then you need to reassign the modified dataframe to the original dataframe, like the following: 

In [None]:
nascar = nascar.drop('Race/Year', axis=1) 
nascar.head()

Alternatively, you may use `inplace=True` argument. 

In [None]:
nascar.drop('Total Cars', axis=1, inplace=True)
nascar.head()

Now, re-run the previous code one more time, why are you getting an error when you are trying to remove the 'Total Cars' again from the table? How to fix this problem? In other words, what cell do you need to run again so that you can correctly, re-drop the column? 

Ans: This is because `Total Cars`already got deleted when you ran the cell for the first time. When you try to re-run the cell, or re-drop the column, it throws an error saying this column name doesn't exist anymore. To fix this issue, you need to read the nascar data back to the original nascar dataframe, and then re-run the cell to delete the `Total Cars` again. This is as if you are running the notebook from the beginning. 

### Question 17

Write a python code to drop the columns `Laps Completed` and `Car Make` from the dataframe, `nascar`. Your dataframe should have only columns `Race`, `Year`, `Race/Year`, `Fin Pos`, `Start_Pos`, `Winnings`, `Driver`. 


In [None]:
nascar = pd.read_csv("data/nascar.csv")
nascar.head()

In [None]:
nascar = ...
nascar.head()

In [None]:
grader.check("q17")

### Sorting rows of a dataframe

The `sort_method` method created a new table by arranging the rows of the original table in ascending order of the values in the specified column. 

Let's sort the nascar dataset in ascending order of their `Start pos`. 

In [None]:
nascar.columns

In [None]:
nascar_sorted = nascar.sort_values('Start Pos', axis=0, ascending=True)
nascar_sorted.head()

### Question 18

Write a python code to sort the `nascar` dataframe, based on their `Winnings` in ascending order. 


In [None]:
nascar_sorted = ...
nascar_sorted.head()

In [None]:
grader.check("q18")

### Question 19

Write a python code to find the name of the driver with highest winnings. Your answer should be a string. Assume no two players have same highest winnings. 


In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q19")

### Question 20

Write a python code to find the driver name (`Driver`), Race number (`Race`) and Year (`Year) with second highest winnings. Your answer should be a pandas series. Assume no two players have same highest winnings. 

In [None]:
your_ans = ...
your_ans

In [None]:
grader.check("q20")