# CPSC 4300/6300-001 Applied Data Science (Fall 2020)

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Dane Acena"
COLLABORATORS = ""

# CPSC4300/6300-001 Problem Set #1

In this problem set, you will create a NCAA College Football team dataset by scraping data from the Internet and perform some basic data manipulations. The purpose of this problem set is to get familiar with data collection with Python and gain some hand-on skills of using Python to some routine tasks.

The assignment has four parts: part A and B are auto-graded. Part C and D are manually graded by the TA. 

Before submission, make sure you run all the cells. Because the TA could not run your notebook, your grade will be only based on the output and the writing part of your submitted notebook. 

## Part A: Find the List of NCAA College Football Teams (30 points)

To start, we first create a list of NCAA football teams from ESPN's website. To save your time, we provide a Python module which scrapes the list of teams from url.

In [2]:
%%writefile espn.py
import requests
from bs4 import BeautifulSoup
from requests.exceptions import HTTPError

NCAAF_URL="https://www.espn.com/college-football/teams"

def fetch_ncaaf_teams(url=NCAAF_URL):
    """Fetch the list of ncaa football teams
    
    Args:
        url (string): the url of the web page for ncaa football teams 
    
    Returns:
            
    """
    teams = []
    
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html, 'html.parser')
    
    conferences = soup.find_all("div", class_="headline")
    for conference in conferences:
        conference_name = conference.get_text()
        team_links = conference.parent.find_all("a", class_="AnchorLink")
        for team_link in team_links:
            if team_link.find("h2"):
                team_url = team_link['href']
                team_name = team_link.get_text()
                team_id = int(team_url.split("/")[-2])
                teams.append({"id": team_id, "name": team_name, "conference": conference_name, "url": team_url})
    return teams

Writing espn.py


In [3]:
!ls

espn.py      part_a.ipynb part_b.ipynb part_c.ipynb part_d.ipynb


### Question 1.  Import and use python functions

Write a Python statement to import the `fetch_ncaaf_teams()` function and use it to fetch the list of college football teams and save the list into a variable named `team_list`. (2 points)

In [4]:
from espn import fetch_ncaaf_teams
team_list = fetch_ncaaf_teams()
# raise NotImplementedError()

In [5]:
# test
assert type(team_list) is list
print("There are {} ncaa football teams".format(len(team_list)))

There are 130 ncaa football teams


In [6]:
assert 'fetch_ncaaf_teams' in dir()
assert type(team_list) is list

In [7]:
assert "Clemson Tigers" in [team["name"] for team in team_list]

### Question 2. Write files in JSON format

JSON (JavaScript Object Notation) is a common data file format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. For more, see https://en.wikipedia.org/wiki/JSON. Complete a code segment that write the team list into a JSON file named `"ncaaf_teams.json"`. (2 points) 

In [8]:
import json
with open('ncaaf_teams.json', 'w') as f:
    json.dump(team_list, f)
# raise NotImplementedError()

In [9]:
import os
assert os.path.exists("ncaaf_teams.json")

## Question 3. Read JSON file into a Data Frame

Write a code segment that reads the teams from a JSON file into a DataFrame object named `teams`. (2 poinst)

In [10]:
import numpy as np
import pandas as pd
teams = pd.read_json("ncaaf_teams.json")
# raise NotImplementedError()

In [11]:
assert type(teams) is pd.core.frame.DataFrame

In [12]:
assert pd.unique(teams["conference"]).size == 11

### Question 4. Examine a DataFrame Object

Write a code segment that outputs the first 20 rows of the Dataframe `teams` to a variable `row1_20`. (2 points)

In [13]:
row1_20 = teams.loc[:19]
# raise NotImplementedError()
print(row1_20.shape)

(20, 4)


In [14]:
assert row1_20.shape == (20, 4)

### Question 5. Find the columns of the DataFrame Object

Write a code segment that saves the names of the columns of the `teams` object. (2 points)

In [15]:
columns = teams.columns
# raise NotImplementedError()

In [16]:
assert type(columns) == pd.core.indexes.base.Index and columns.shape == (4,)

### Question 6. Find the size of the DataFrame Object

Write a single statement that outputs the number of rows and columns of the `teams` DataFrame Object and save the number rows to variable `nrow` and the number of columns to variable `ncols`. (2 points)

In [17]:
ncols = teams.columns.size
nrows = teams.size
# raise NotImplementedError()
nrows, ncols

(520, 4)

### Question 7. Drop a Column in DataFrame

Write a statement to drop the url column in the teams DataFrame object without creating a new copy of the object. (2 points)

In [18]:
teams = teams.drop(['url'],axis=1) #teams = teams.drop(columns=['url'])
# raise NotImplementedError()

In [19]:
assert teams.columns.size == 3 and "url" not in teams.columns

### Question 8. Set the team id as the Index of the teams object

In [20]:
teams.set_index('id', inplace=True)
# raise NotImplementedError()

In [21]:
teams.head()

Unnamed: 0_level_0,name,conference
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2132,Cincinnati Bearcats,American
151,East Carolina Pirates,American
248,Houston Cougars,American
235,Memphis Tigers,American
2426,Navy Midshipmen,American


In [22]:
assert(teams.index.name == 'id')

### Question 9. Select Rows by Column Value or Index

Select the team whose team id is 259 and save it team_259.

In [23]:
team_259 = teams.loc[[259]]
# raise NotImplementedError()
team_259

Unnamed: 0_level_0,name,conference
id,Unnamed: 1_level_1,Unnamed: 2_level_1
259,Virginia Tech Hokies,ACC


In [24]:
assert team_259.iloc[0]['name'] == 'Virginia Tech Hokies'


### Question 10. Select Rows Based on Specific Conditions

Find rows in the teams DataFrame object in which the team in the ACC conference and save the results into a variable `acc_teams`. (2 points)

In [25]:
acc_teams = teams[teams['conference'].str.contains('ACC')]
# raise NotImplementedError()
acc_teams

Unnamed: 0_level_0,name,conference
id,Unnamed: 1_level_1,Unnamed: 2_level_1
103,Boston College Eagles,ACC
228,Clemson Tigers,ACC
150,Duke Blue Devils,ACC
52,Florida State Seminoles,ACC
59,Georgia Tech Yellow Jackets,ACC
97,Louisville Cardinals,ACC
2390,Miami Hurricanes,ACC
152,NC State Wolfpack,ACC
153,North Carolina Tar Heels,ACC
87,Notre Dame Fighting Irish,ACC


In [26]:
# test
assert pd.unique(acc_teams['conference'])[0] == 'ACC'

In [27]:
assert pd.unique(acc_teams['conference'])[0] == 'ACC'

In [28]:
assert acc_teams['conference'].shape[0] == 15

### Question 11. Write a DataFrame Object to a JSON File

Write to the acc_teams object to a JSON file named `acc_teams.json`. (2 points)

In [29]:
json_acc_teams = acc_teams.to_json(orient='index')
print(type(acc_teams))
with open('acc_teams.json', 'w') as f:
    f.write(json_acc_teams)
# raise NotImplementedError()

<class 'pandas.core.frame.DataFrame'>


In [30]:
!cat 'acc_teams.json'

{"103":{"name":"Boston College Eagles","conference":"ACC"},"228":{"name":"Clemson Tigers","conference":"ACC"},"150":{"name":"Duke Blue Devils","conference":"ACC"},"52":{"name":"Florida State Seminoles","conference":"ACC"},"59":{"name":"Georgia Tech Yellow Jackets","conference":"ACC"},"97":{"name":"Louisville Cardinals","conference":"ACC"},"2390":{"name":"Miami Hurricanes","conference":"ACC"},"152":{"name":"NC State Wolfpack","conference":"ACC"},"153":{"name":"North Carolina Tar Heels","conference":"ACC"},"87":{"name":"Notre Dame Fighting Irish","conference":"ACC"},"221":{"name":"Pittsburgh Panthers","conference":"ACC"},"183":{"name":"Syracuse Orange","conference":"ACC"},"258":{"name":"Virginia Cavaliers","conference":"ACC"},"259":{"name":"Virginia Tech Hokies","conference":"ACC"},"154":{"name":"Wake Forest Demon Deacons","conference":"ACC"}}

In [31]:
assert os.path.exists('acc_teams.json')

### Question 12. The Data Oragnization of JSON Format

Compare the file `ncaaf_teams.json` and the file `acc_teams.json`, you may notice that even though both files store similar type of information, the data organizations in the two files are different. 

Which file organizes the data as a list of dict?

In [32]:
filename = "ncaaf_teams.json"
# raise NotImplementedError()
filename

'ncaaf_teams.json'

Suppose both files are read into memory and the number of teams is large, which data organization is more efficient to search a team by their team id?

In [33]:
filename = "acc_teams.json"
# raise NotImplementedError()
filename

'acc_teams.json'

### Question 13. CSV file format.

Write the `acc_teams` data frame into a csv file `acc_teams.csv`. (2 points)

In [34]:
acc_teams.to_csv('acc_teams.csv', index=False)
# raise NotImplementedError()

In [35]:
import os
assert os.path.exists('acc_teams.csv')

In [36]:
!cat 'acc_teams.csv'

name,conference
Boston College Eagles,ACC
Clemson Tigers,ACC
Duke Blue Devils,ACC
Florida State Seminoles,ACC
Georgia Tech Yellow Jackets,ACC
Louisville Cardinals,ACC
Miami Hurricanes,ACC
NC State Wolfpack,ACC
North Carolina Tar Heels,ACC
Notre Dame Fighting Irish,ACC
Pittsburgh Panthers,ACC
Syracuse Orange,ACC
Virginia Cavaliers,ACC
Virginia Tech Hokies,ACC
Wake Forest Demon Deacons,ACC


### Question 14. Sort Arrays

Sort the ACC teams by team names in alphabeta order and save the results in a list `sorted acc_team_names`. (2 points)

In [37]:
sorted_acc_team_names = acc_teams['name'].to_list()
# raise NotImplementedError()

In [38]:
### BEGIN TESTS
assert ''.join([team[0] for team in sorted_acc_team_names]) == 'BCDFGLMNNNPSVVW'
### END TESTS

### Congratulations! You have finished Part A of the problem.