# Denison CS181/DA210 SW Lab #5 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [1]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

---

## Part A: Deleting Columns/Rows

We'll use the following subset of the `topnames` dataset as an example throughout.  Note that we'll use the `.copy()` method to copy our `DataFrame` before making any changes to it.

In [2]:
# Create the DataFrame from a List of Lists
topnamesLoL = [ [2018, "Male", "Liam", 19837],
                [2018, "Female", "Emma", 18688],
                [2017, "Male", "Liam", 18798],
                [2017, "Female", "Emma", 19800],
                [2016, "Male", "Noah", 19117],
                [2016, "Female", "Emma", 19496] ]
topnamesColumns = ["year", "sex", "name", "count"]

topnames0 = pd.DataFrame(topnamesLoL, columns=topnamesColumns)

# View the DataFrame before making any changes
topnames0

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19837
1,2018,Female,Emma,18688
2,2017,Male,Liam,18798
3,2017,Female,Emma,19800
4,2016,Male,Noah,19117
5,2016,Female,Emma,19496


#### Single-column deletion

If we want to delete a single column from a `DataFrame`, we can use a `del` statement.

In [3]:
# First, copy the DataFrame (we'll modify the copy)
tn2 = topnames0.copy()

# Delete the 'name' column
del tn2["name"]

# Display the modified DataFrame
tn2

Unnamed: 0,year,sex,count
0,2018,Male,19837
1,2018,Female,18688
2,2017,Male,18798
3,2017,Female,19800
4,2016,Male,19117
5,2016,Female,19496


Similarly, we could use `pop()` to delete a single column.  Like with `list`s in Python, `pop()` returns the removed element.

In [4]:
# First, copy the DataFrame (we'll modify the copy)
tn3 = topnames0.copy()

# Delete and store the 'count' column
column_series = tn3.pop("count")

# Display the modified DataFrame
tn3

Unnamed: 0,year,sex,name
0,2018,Male,Liam
1,2018,Female,Emma
2,2017,Male,Liam
3,2017,Female,Emma
4,2016,Male,Noah
5,2016,Female,Emma


As the result of using `pop()` is a single column, its type is a `Series`.

In [5]:
# Display the popped Series
column_series

0    19837
1    18688
2    18798
3    19800
4    19117
5    19496
Name: count, dtype: int64

#### Multiple-column deletion

We can use the `drop()` method of the `DataFrame` class to drop multiple columns.  The first argument to `drop()` should be a single column label (to drop one column) or a list of column labels (to drop multiple columns).

In [6]:
# First, copy the DataFrame (we'll modify the copy)
tn4 = topnames0.copy()

# Delete just the 'Name' column
tn4.drop('name', axis=1, inplace=True)

# Display the modified DataFrame
tn4

Unnamed: 0,year,sex,count
0,2018,Male,19837
1,2018,Female,18688
2,2017,Male,18798
3,2017,Female,19800
4,2016,Male,19117
5,2016,Female,19496


In the above example, we specified `inplace=True`.  This modifies the given `DataFrame`.  We could skip the copy step by using `inplace=False`, which would return the modified `DataFrame` copy.

In [7]:
# Delete just the 'Name' column
tn5 = topnames0.drop('name', axis=1, inplace=False) # return a modified copy

# Display the modified DataFrame (same as tn4)
tn5

Unnamed: 0,year,sex,count
0,2018,Male,19837
1,2018,Female,18688
2,2017,Male,18798
3,2017,Female,19800
4,2016,Male,19117
5,2016,Female,19496


#### Row deletion

In the previous two examples, we used `axis=1` to specify that we wanted to drop one or more columns.  We could instead use `axis=0` to specify that we should drop rows.

In [8]:
# Delete rows 2-3 (using the row labels)
tn6 = topnames0.drop([2,3], axis=0, inplace=False) # return a modified copy

# Display the modified DataFrame
tn6

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19837
1,2018,Female,Emma,18688
4,2016,Male,Noah,19117
5,2016,Female,Emma,19496


If we use a multi-level index for row labels, we can specify a drop using a specific level.

In [9]:
# Copy topnames0 and give it a two-level index
tn7 = topnames0.set_index(['year', 'sex'], inplace=False)

# Delete rows for 2017 (using the two-level index)
tn7.drop([2017], level="year", axis=0, inplace=True)

# Display the modified DataFrame
tn7

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Male,Liam,19837
2018,Female,Emma,18688
2016,Male,Noah,19117
2016,Female,Emma,19496


In [10]:
# Copy topnames0 and give it a two-level index
tn8 = topnames0.set_index(['year', 'sex'], inplace=False)

# Delete rows for Male (using the two-level index)
tn8.drop(["Male"], level="sex", axis=0, inplace=True)

# Display the modified DataFrame
tn8

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Female,Emma,18688
2017,Female,Emma,19800
2016,Female,Emma,19496


---

## Part B: Adding a Column

We can add a new column, represented as a `Series`, to an existing `DataFrame`.  To do this, we use the same syntax to project a column, but on the _left-hand side_ of an assignment statement.

In [11]:
# First, copy the DataFrame (we'll modify the copy)
tn9 = topnames0.copy()

# Add some new columns
tn9["oddyear"] = tn9["year"] % 2 == 1                     # year is odd
tn9["namelen"] = tn9["name"].apply(len)                   # length of each name
tn9["namecaps"] = tn9["name"].apply(lambda s: s.upper())  # name in all caps

# Display the modified DataFrame
tn9

Unnamed: 0,year,sex,name,count,oddyear,namelen,namecaps
0,2018,Male,Liam,19837,False,4,LIAM
1,2018,Female,Emma,18688,False,4,EMMA
2,2017,Male,Liam,18798,True,4,LIAM
3,2017,Female,Emma,19800,True,4,EMMA
4,2016,Male,Noah,19117,False,4,NOAH
5,2016,Female,Emma,19496,False,4,EMMA


Note that if we project multiple columns, this projection is tied to the original data.  This means that if we modify the projection, we modify the original as well.

The same thing occurs with Python `list`s.

In [12]:
# Start with a large list
mylist = [1, 3, 6, 7, 19, 22]

# Create an alias for my list
alias = mylist

# Modify the alias
alias[-1] = -10
alias.append(1000)

# Check the state of both
print(mylist)
print(alias)

[1, 3, 6, 7, 19, -10, 1000]
[1, 3, 6, 7, 19, -10, 1000]


With `DataFrame`s, this can cause issues if we try to add a column to a projection.

In [13]:
# First, copy the DataFrame (we'll modify the copy)
tn10 = topnames0.copy()

# Project the "name" and "count" columns
tn10_proj = tn10[["name", "count"]]

# Attempt to add a column to the projection
tn10_proj["namelen"] = tn10["name"].apply(len) # displays warning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


However, despite this warning, it will add the column to the projection.

In [14]:
# Display the original DataFrame
tn10

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19837
1,2018,Female,Emma,18688
2,2017,Male,Liam,18798
3,2017,Female,Emma,19800
4,2016,Male,Noah,19117
5,2016,Female,Emma,19496


In [15]:
# Display the modified projection
tn10_proj

Unnamed: 0,name,count,namelen
0,Liam,19837,4
1,Emma,18688,4
2,Liam,18798,4
3,Emma,19800,4
4,Noah,19117,4
5,Emma,19496,4


Because of the assumed correspondence of a projection and the original data, we can use `copy()` to make clear that the assignment is a one-time operator.

---

## Part C: Updating columns

We can use the same syntax for adding a column to update all values in a given column.

In [16]:
# First, copy the DataFrame (we'll modify the copy)
tn11 = topnames0.copy()

# Modify the 'count' column (e.g., to change units)
tn11["count"] = tn11["count"] / 1000

# Display the modified DataFrame
tn11

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19.837
1,2018,Female,Emma,18.688
2,2017,Male,Liam,18.798
3,2017,Female,Emma,19.8
4,2016,Male,Noah,19.117
5,2016,Female,Emma,19.496


We can use `.loc` and a row filter to update only some values in a given column.

In [17]:
# First, copy the DataFrame (we'll modify the copy)
tn12 = topnames0.copy()

# Modify the 'count' column for rows with Female 'sex' (e.g., to change units)
tn12.loc[tn12.sex == "Female", "count"] = tn12["count"] / 1000

# Display the modified DataFrame
tn12

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19837.0
1,2018,Female,Emma,18.688
2,2017,Male,Liam,18798.0
3,2017,Female,Emma,19.8
4,2016,Male,Noah,19117.0
5,2016,Female,Emma,19.496


We have the same issue trying to modify the values in a projection as we did trying to add a column.

In [18]:
# First, copy the DataFrame (we'll modify the copy)
tn13 = topnames0.copy()

# Project the "name" and "count" columns
tn13_proj = tn13[["name", "count"]]

# Modify the projection
tn13_proj.loc[:,"count"] = tn13_proj["count"] / 1000 # displays a warning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


---

## Part D: Try it Yourself

**Q1:** Read CSV file `indicators2016.csv` in `datadir` into a data frame named `indicators0`, with no index.

In [19]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display a subset of the DataFrame
indicators0 = pd.read_csv(os.path.join(datadir, "indicators2016.csv"))
indicators0.head()

Unnamed: 0,code,country,pop,gdp,life,cell
0,ABW,Aruba,0.1,,75.87,
1,AFG,Afghanistan,34.66,19.47,63.67,21.6
2,AGO,Angola,28.81,95.34,61.55,13.0
3,ALB,Albania,2.88,11.86,78.34,3.37
4,AND,Andorra,0.08,2.86,,0.07


**Q2:** Use `.pop()` to remove the `'code'` column from the dataset (modifying the original dataset), and store the resulting `Series` in `code_series`.

In [20]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the Series
code_series = indicators0.pop("code")

In [21]:
# Display the modified DataFrame (should not have a 'code' column)
indicators0.head()

Unnamed: 0,country,pop,gdp,life,cell
0,Aruba,0.1,,75.87,
1,Afghanistan,34.66,19.47,63.67,21.6
2,Angola,28.81,95.34,61.55,13.0
3,Albania,2.88,11.86,78.34,3.37
4,Andorra,0.08,2.86,,0.07


**Q3:** Make a copy of `indicators0`, called `indicators`, and assign `code_series` to be its index.

In [30]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the modified DataFrame copy
indicators = indicators0.copy()
# indicators["code"] = code_series
indicators.set_index(code_series)
# indicators.set_index(code_series, inplace=True)

Unnamed: 0,country,pop,gdp,life,cell
0,Aruba,0.10,,75.87,
1,Afghanistan,34.66,19.47,63.67,21.60
2,Angola,28.81,95.34,61.55,13.00
3,Albania,2.88,11.86,78.34,3.37
4,Andorra,0.08,2.86,,0.07
...,...,...,...,...,...
215,Kosovo,1.82,6.65,71.65,
216,"Yemen, Rep.",27.58,27.32,64.95,16.43
217,South Africa,56.02,295.46,62.77,82.41
218,Zambia,16.59,21.06,61.87,12.02


> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: After popping the `'code'` column from `indicators0`, how could you instead create a new column in the data frame with the same value as `code_series`, effectively putting that column back into the data frame?

---

## Part E: A New Dataset

The file `members.csv` in `datadir` has (fake) information on a number of individuals in Ohio.

**Q4:** Read this dataset into a `pandas DataFrame` using `read_csv`.  Name it `members0` and do not include an index.

In [23]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the DataFrame
members0 = pd.read_csv(os.path.join(datadir, "members.csv"))

**Q5:** Repeat the above, but now do include an index by specifying `index_col` in the constructor.  Name this `DataFrame` `members`.  (Hint: take a look at the file to determine a reasonable index.)

In [24]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the DataFrame
members = pd.read_csv(os.path.join(datadir, "members.csv"), index_col="ID#")

**Q6:** We will now split the column 'Name' into two different columns for first and last name.  As a first step:

  1. Write a lambda function that will, given a string, split on spaces and select only the first element of the resultant list.
  2. Apply the lambda function to the `'Name'` column, and store the result as `fname_series`.

In [25]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the Series
fname_series = members["Name"].apply(lambda name: name.split(" ")[0])

**Q7:** Assign `fname_series` as a new column in `members` with column name `FName`.

In [26]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the DataFrame (should have a new column, FName)
members["FName"] = fname_series

**Q8:** Similar to Q6-Q7, create a new column, called `LName`, in `members` using the last name.

In [27]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the DataFrame (should have a new column, LName)
members["LName"] = members["Name"].apply(lambda name: name.split(" ")[1])

**Q9:** Similar to Q6-Q8, create two new columns, called `City` and `State`, based on the `Address` column.  Make sure that neither the city nor state has any leading or trailing spaces.

In [28]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the DataFrame (should have two new columns, City and State)
members["City"] = members["Address"].apply(lambda addr: addr.strip().split(", ")[0])
members["State"] = members["Address"].apply(lambda addr: addr.strip().split(", ")[1])

**Q10:** Drop the original `'Name'` and `'Address'` columns.

In [29]:
# YOUR CODE HERE
# raise NotImplementedError()

# Display the DataFrame (should have removed Name and Address columns)
members.drop(["Name", "Address"], axis=1, inplace=True)
members

"""
members.loc[members['LName'] == "Marshall" & members['FName'] == "Kirk", "LName"] = "Crossley"
members.loc[732, "LName"] = "Crossley"
members.iloc[0, 4] = "Crossley"
"""

'\nmembers.loc[members[\'LName\'] == "Marshall", "LName"] = "Crossley"\nmembers.iloc[0, 4] = "Crossley"\n'

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: One of the users, Kirk Marshall, wants to change his last name to Crossley.  How would you use `loc` do do this?  What about `iloc`?

---

---
## Part F

How much time (in minutes/hours) did you spend on this lab outside of class?

I completed in class