# Lecture 7 Otaining Data from a File
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* File Requirements
  * Description of Variables (in file or as a separate document)
* Loading Data from a File
  * Column Names
  * Headers
  * Showing just the head or tail of the data
  * Splitting Variables

In [None]:
import pandas as pd

-----
## Loading Data from File

Data needs to be stored in a certain file format to be loaded into Python. Python can handle many, but Pandas is able to load more.
* [File types handled by Python - McKinney, Chapter 6](https://wesmckinney.com/book/accessing-data#tbl-table_parsing_functions)

Regardless of the file type, we must know what we are dealing with. Consider this dataset:

In [None]:
import pandas as pd
grades = pd.read_csv("../Datasets/grades.csv")
grades

Uh-oh! Something is wrong. We can see the data, and from the name of the file, it looks like we have some sort of list of grades. We have the names of the students, but we have no idea what each variable in our DataFrame is.

Let's look now at a few requirements for good datasets.

-----
## Good Rules for Handling Data

First thing, take a look at the data. It is always good to see what the data looks like before we try to do anything with it.
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades.csv)

1. Look at your data before loading it
2. When you are saving data, make sure there is documentation
    * Documentation includes an explanation of what the dataset is, what the columns of the dataset are, and units for each variable.

This file has data, but the columns have no labels. When we tried to import earlier, the first line was assumed as a header line and became labels for the columns. To avoid that, we tell Pandas that there is no header.

In [None]:
grades = pd.read_csv('../Datasets/grades.csv', header=None)
grades.head(10)

However, this is as far as we can go. Without an explanation of the variables, our dataset is basically useless, we we can't decide what to do with it.

Every dataset MUST have some explanation of the variables in our dataset. Very often, this is done with a separate README file. For this dataset, look at the documentation here:
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades_README.txt](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades_README.txt)

Using this, we can update our DataFrame.

In [None]:
grades = pd.read_csv('../Datasets/grades.csv', header=None)
grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']

# See the first few rows of the data (default=5)
grades.head()   

Sometimes, the variables are labeled in the data, as in the following example, though the labels could be in code. Again, the README file should explain what each means.
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades1.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades1.csv)

In [None]:
grades = pd.read_csv('../Datasets/grades1.csv') # No Header statement means header=1
display(grades.head(4))

grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']
display(grades.head(4))

Sometimes, the file is separated by a character other than a comma. Some common separators (a.k.a. delimiters) include:
* `;` a semi-colon
* `:` a colon
* ` ` a space
* `\t` a tab

Here's the same dataset, only this time separated by a semicolon (;).
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades2.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades2.csv)

In [None]:
grades = pd.read_csv('../Datasets/grades2.csv')
grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']

display(grades.head(4))

Finally, we there is one other way documentation might be provided. We might see the labels *in the file itself*. If we are following good practice, we will open the file before importing the data and see these labels.
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades3.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades3.csv)

In this case, there are extra lines ahead of the data which explain the data itself. But this makes loading the data more difficult. For this, use the `skiprows=` argument.

In [None]:
grades = pd.read_csv('../Datasets/grades3.csv', skiprows=9)
display(grades.head(4))

grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']
display(grades.head(4))

-----
## Loading data from an Excel file

We could simply read an excel file just like any other file type.

In [None]:
grades = pd.read_excel('../grades3.xlsx', skiprows=10)
grades

However, Pandas offers some more functionality. If we use `pd.ExcelFile()` to load the file, then we get an object with all worksheets within the file. We can then choose which one we want to work with.

In [None]:
excel = pd.ExcelFile('grades3.xlsx')
excel

In [None]:
excel.sheet_names

In [None]:
grades = excel.parse(sheet_name='grades3', skiprows=10)
grades

-----
## Loading data from a file on the Internet

Loading a file from the internet is just like reading it from a file. Just use a web directory instead of a file directory.

In [None]:
salaries = pd.read_csv('https://raw.githubusercontent.com/drolsonmi/math3080/main/Datasets/data_science_salaries.csv')
salaries.head(4)

-----
## Web Scraping


Web scraping is a very important tool and technique. A lot of the data is on the internet, set up as a table on an HTML page.

If you are like me, you have handled HTML tables by going to the webpage, copying the table, putting them into an excel file, then work for an hour or two trying to format the data to a format you can use. This is obnoxious and takes way too much time.

The idea of __web scraping__ is to go through the file itself and identify any tables in the file. Most commonly, we apply web scraping to *HTML* and *xml* files.

### Web Scraping with HTML

* Uses `pd.read_html()`
* Requires dependency `lxml`

In Pandas, we have a `pd.read_html()` function. This will take the given HTML file and look for `<table>` tags. If it finds one, then it can decode the table and save it as a DataFrame.
* to read a table from an html file, install the `lxml` package

What if there are multiple tables on the webpage? The `pd.read_html()` command will find all `<table>` tags and convert all of them into a DataFrame, saving them all into an array. So, the output of the `pd.read_html()` command is an array. To access the table you want, just call that table from the array.

Look at the [author's FDIC example](https://raw.githubusercontent.com/wesm/pydata-book/3rd-edition/examples/fdic_failed_bank_list.html)

In [None]:
banks = pd.read_html('https://raw.githubusercontent.com/wesm/pydata-book/3rd-edition/examples/fdic_failed_bank_list.html')
display(banks[0])

Let's do another example, looking at details of the [Lord of the Rings Movie Trilogy](https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)) on Wikipedia.

In [27]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)"

lotr = pd.read_html(url)
display(lotr)

HTTPError: HTTP Error 403: Forbidden

In [1]:
import requests
import pandas as pd

url = "https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)"

# Set headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

# Fetch the page using requests with headers
response = requests.get(url, headers=headers)

# Parse the HTML content with pandas
tables = pd.read_html(response.text)

# Cycle through tables 0-3
display(tables[0])

  tables = pd.read_html(response.text)


Unnamed: 0,The Lord of the Rings,The Lord of the Rings.1
0,,
1,Directed by,Peter Jackson
2,Screenplay by,Fran Walsh Philippa Boyens Peter Jackson Steph...
3,Based on,The Lord of the Rings by J. R. R. Tolkien
4,Produced by,Barrie M. Osborne Peter Jackson Fran Walsh Tim...
5,Starring,Elijah Wood Ian McKellen Viggo Mortensen Sean ...
6,Cinematography,Andrew Lesnie
7,Edited by,John GilbertFOTR Michael HortonTT Jamie Selkir...
8,Music by,Howard Shore
9,Production companies,New Line Cinema WingNut Films


In [2]:
display(tables[4])

Unnamed: 0_level_0,Film,U.S. release date,Box office gross,Box office gross,Box office gross,All-time ranking,All-time ranking,All-time ranking,All-time ranking,Budget,Ref(s)
Unnamed: 0_level_1,Film,U.S. release date,U.S. and Canada,Other territories,Worldwide,U.S. and Canada,U.S. and Canada,Worldwide,Worldwide,Budget,Ref(s)
Unnamed: 0_level_2,Film,U.S. release date,U.S. and Canada,Other territories,Worldwide,Rank,Peak,Rank,Peak,Budget,Ref(s)
0,The Fellowship of the Ring,19 December 2001,"$319,372,078","$568,468,333","$887,840,411",91.0,9.0,77.0,5.0,$93 million,[64][65][66]
1,The Two Towers,18 December 2002,"$345,518,923","$592,392,292","$937,911,215",72.0,7.0,71.0,4.0,$94 million,[67][68][69]
2,The Return of the King,17 December 2003,"$381,878,219","$756,308,664","$1,138,186,883",51.0,6.0,31.0,2.0,$94 million,[70][71][72]
3,Total,Total,"$1,046,769,220","$1,917,169,289","$2,963,938,509",,,,,$281 million,[note 1]


In [3]:
box_office = tables[4]
print(box_office)

                         Film U.S. release date Box office gross  \
                         Film U.S. release date  U.S. and Canada   
                         Film U.S. release date  U.S. and Canada   
0  The Fellowship of the Ring  19 December 2001     $319,372,078   
1              The Two Towers  18 December 2002     $345,518,923   
2      The Return of the King  17 December 2003     $381,878,219   
3                       Total             Total   $1,046,769,220   

                                    All-time ranking                      \
  Other territories       Worldwide  U.S. and Canada      Worldwide        
  Other territories       Worldwide             Rank Peak      Rank Peak   
0      $568,468,333    $887,840,411             91.0  9.0      77.0  5.0   
1      $592,392,292    $937,911,215             72.0  7.0      71.0  4.0   
2      $756,308,664  $1,138,186,883             51.0  6.0      31.0  2.0   
3    $1,917,169,289  $2,963,938,509              NaN  NaN       NaN

In [4]:
for i in range(len(box_office.columns)):
    print(i, box_office.columns[i])

0 ('Film', 'Film', 'Film')
1 ('U.S. release date', 'U.S. release date', 'U.S. release date')
2 ('Box office gross', 'U.S. and Canada', 'U.S. and Canada')
3 ('Box office gross', 'Other territories', 'Other territories')
4 ('Box office gross', 'Worldwide', 'Worldwide')
5 ('All-time ranking', 'U.S. and Canada', 'Rank')
6 ('All-time ranking', 'U.S. and Canada', 'Peak')
7 ('All-time ranking', 'Worldwide', 'Rank')
8 ('All-time ranking', 'Worldwide', 'Peak')
9 ('Budget', 'Budget', 'Budget')
10 ('Ref(s)', 'Ref(s)', 'Ref(s)')


In [5]:
for i in range(len(box_office.columns)):
    print(i, box_office.columns[i][2])

0 Film
1 U.S. release date
2 U.S. and Canada
3 Other territories
4 Worldwide
5 Rank
6 Peak
7 Rank
8 Peak
9 Budget
10 Ref(s)


In [6]:
new_col_names = []

for i in range(len(box_office.columns)):
    new_col_names.append(box_office.columns[i][2])

box_office.columns = new_col_names
display(box_office)

Unnamed: 0,Film,U.S. release date,U.S. and Canada,Other territories,Worldwide,Rank,Peak,Rank.1,Peak.1,Budget,Ref(s)
0,The Fellowship of the Ring,19 December 2001,"$319,372,078","$568,468,333","$887,840,411",91.0,9.0,77.0,5.0,$93 million,[64][65][66]
1,The Two Towers,18 December 2002,"$345,518,923","$592,392,292","$937,911,215",72.0,7.0,71.0,4.0,$94 million,[67][68][69]
2,The Return of the King,17 December 2003,"$381,878,219","$756,308,664","$1,138,186,883",51.0,6.0,31.0,2.0,$94 million,[70][71][72]
3,Total,Total,"$1,046,769,220","$1,917,169,289","$2,963,938,509",,,,,$281 million,[note 1]


In [7]:
lotr_soundtrack = tables[3]
display(lotr_soundtrack)

Unnamed: 0,Title,U.S. release date,Length,Composer,Label
0,The Fellowship of the Ring: Original Motion Pi...,20 November 2001,71:29,Howard Shore,Reprise Records
1,The Two Towers: Original Motion Picture Soundt...,10 December 2002,72:46,Howard Shore,Reprise Records
2,The Return of the King: Original Motion Pictur...,25 November 2003,72:05,Howard Shore,Reprise Records


### Web Scraping with BeautifulSoup

`BeautifulSoup` is a package that simplifies the process of web scraping.

In [12]:
!pip install bs4
!pip install html5lib

Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Collecting webencodings (from html5lib)
  Downloading webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: webencodings, html5lib
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [html5lib]
[1A[2KSuccessfully installed html5lib-1.1 webencodings-0.5.1


In [15]:
url = "https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)"

# Set headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

# Fetch the page using requests with headers
response = requests.get(url, headers=headers)
data = response.text
print(data)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>The Lord of the Rings (film series) - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-fo

In [19]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(data)

In [22]:
table = soup.find('table')

for row in table.find_all('tr'):  # in html table row represented by tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td')  # in html a column is represented by tag <td>
    color_name = cols[2].string  # store the value in column 3 as color_name
    color_code = cols[3].text  # store the value in column 4 as color_code
    print("{}--->{}".format(color_name, color_code))

IndexError: list index out of range