A final report, as a Jupyter notebook with executed cells, containing the following sections:
Introduction. What is the context of the work? What research question are you trying to answer? What are your main findings? 

Data description. This should be inspired by the format presented in Gebru et al, 2018. Answer the following questions:
    What are the observations (rows) and the attributes (columns)?
    Why was this dataset created?
    Who funded the creation of the dataset?
    What processes might have influenced what data was observed and recorded and what was not?
    What preprocessing was done, and how did the data come to be in the form that you are using?
    If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to 
    be used for?
    
Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box). 

Data analysis.
Use summary functions like mean and standard deviation along with visual displays like scatter plots and histograms to describe data.

Provide at least one model showing patterns or relationships between variables that addresses your research question. 
   
   This could be a regression or clustering, or something else that measures some property of the dataset.

    Evaluation of significance. Use hypothesis tests, simulation, randomization, or any other techniques we have 
    learned to compare the patterns you observe in the dataset to simple randomness.
    

    Conclusion. What did you find over the course of your data analysis, and how confident are you in these 
    conclusions? Interpret these results in the wider context of the real-life application from where your data hails.
    
    Source code. Provide a link to your Github repository (or other file hosting site) that has all of your project 
    code (if applicable). For example, you might include web scraping code or data filtering and aggregation code.
    
Acknowledgments. Recognize any people or online resources that you found helpful. These can be tutorials, software packages, Stack Overflow questions, peers, and data sources. Showing gratitude is a great way to feel happier! But it also has the nice side-effect of reassuring us that you're not passing off someone else's work as your own. Crossover with other courses is permitted and encouraged, but it must be clearly stated, and it must be obvious what parts were and were not done for 2950. Copying without attribution robs you of the chance to learn, and wastes our time investigating.

In [1]:
import sys
!{sys.executable} -m pip install bs4
!{sys.executable} -m pip install lxml

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
# To Do:
#     // Data Table on Line 506 in htm file
#     get the data 
#         (year)
#         (population)
#         (all: number and rate)
#         (passenger: number and rate)
#         (pedestrian: number and rate)
#         (bicyclists: number and rate)
#     store in lists
#     create a data frame of all the data 
#     create plots (scatter plots, plots with linear regression)

In [3]:
from bs4 import BeautifulSoup
import re

In [4]:
with open("car_crashes.htm") as file_reader:
    soup = BeautifulSoup(file_reader, "lxml")

In [5]:
title = print(soup.find("th").get_text())
title

Number of deaths, crashes and motor vehicles in fatal crashes, 1975-2018


In [6]:
table_heads = (soup.find_all("th"))
table_heads 

[<th class="table-title" colspan="4">Number of deaths, crashes and motor vehicles in fatal crashes, 1975-2018</th>,
 <th class="first-column" colspan="1" rowspan="1">Year</th>,
 <th class="" colspan="1" rowspan="1">Deaths</th>,
 <th class="" colspan="1" rowspan="1">Crashes</th>,
 <th class="" colspan="1" rowspan="1">Motor vehicles</th>,
 <th class="first-column" colspan="1" rowspan="1">1975</th>,
 <th class="first-column" colspan="1" rowspan="1">1976</th>,
 <th class="first-column" colspan="1" rowspan="1">1977</th>,
 <th class="first-column" colspan="1" rowspan="1">1978</th>,
 <th class="first-column" colspan="1" rowspan="1">1979</th>,
 <th class="first-column" colspan="1" rowspan="1">1980</th>,
 <th class="first-column" colspan="1" rowspan="1">1981</th>,
 <th class="first-column" colspan="1" rowspan="1">1982</th>,
 <th class="first-column" colspan="1" rowspan="1">1983</th>,
 <th class="first-column" colspan="1" rowspan="1">1984</th>,
 <th class="first-column" colspan="1" rowspan="1">1

In [35]:
print(table_heads[0])

# This function is to convert each element inside of the list into a string, so that titles of tables may more
#easily be found and re can be used
def toString(array):
    new_array = []
    for i in array:
        new_array.append(str(i))
    return new_array

table_heads_str = toString(table_heads)
# This is testing to confirm the conversion was a success 
# print(table_heads_str[0])
# type(table_heads_str[0])

#This isolates the names of the columns and rows of the data I am focusing on.
dataTop = table_heads_str.index('<th class="table-title" colspan="14">Motor vehicle crash deaths per 100,000 people by type, 1975-2018</th>')
dataBottom = table_heads_str.index('<th class="table-title" colspan="14">Motor vehicle crash deaths by type, 1975-2018</th>')
table_heads_str = table_heads_str[dataTop:dataBottom]
# print(table_heads_str)


# This gets all of the isolated names of the columns and rows 
def getInfo(str_list):
    info = []
    for i in str_list:
        first = i.index('>')
        last = i.index('</th>')
        info.append(i[first+1:last])
    return info

data_titles = getInfo(table_heads_str)
print(data_titles)


    


<th class="table-title" colspan="4">Number of deaths, crashes and motor vehicles in fatal crashes, 1975-2018</th>
['Motor vehicle crash deaths per 100,000 people by type, 1975-2018', 'Year', 'Population', 'Passenger vehicle occupants', 'Pedestrians', 'Motorcyclists', 'Bicyclists', 'Large truck occupants', 'All motor vehicle deaths*', 'Number', 'Rate', 'Number', 'Rate', 'Number', 'Rate', 'Number', 'Rate', 'Number', 'Rate', 'Number', 'Rate', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']


In [99]:
table_rows = (soup.find_all("tr"))
table_rows
table_rows_str = toString(table_rows)
# table_rows_str

top = table_rows_str.index('<tr>\n<th class="table-title" colspan="14">Motor vehicle crash deaths per 100,000 people by type, 1975-2018</th>\n</tr>')  
bottom = table_rows_str.index('<tr>\n<th class="table-title" colspan="14">Motor vehicle crash deaths by type, 1975-2018</th>\n</tr>')
table_rows_str = table_rows_str[top:bottom]
table_rows_str.index('<tr class="odd">\n<th class="first-column" colspan="1" rowspan="1">1975</th>\n<td colspan="1" rowspan="1">215,973,199</td>\n<td colspan="1" rowspan="1">30,601</td>\n<td colspan="1" rowspan="1">14.2</td>\n<td colspan="1" rowspan="1">7,516</td>\n<td colspan="1" rowspan="1">3.5</td>\n<td colspan="1" rowspan="1">3,180</td>\n<td colspan="1" rowspan="1">1.5</td>\n<td colspan="1" rowspan="1">1,003</td>\n<td colspan="1" rowspan="1">0.5</td>\n<td colspan="1" rowspan="1">916</td>\n<td colspan="1" rowspan="1">0.4</td>\n<td colspan="1" rowspan="1">44,525</td>\n<td colspan="1" rowspan="1">20.6</td>\n</tr>')
table_rows_str
# re.findall("", table_rows_str)

table_rows_str = table_rows_str[3:]
# print(table_rows_str[0])

tester = table_rows_str[0]
print(tester)
print(tester.find('rowspan="1">'))
print(tester.find('</t'))
a = tester.find('"1">')
b = tester.find('</t')
print("here")
print(tester[a+4:b])


def getData(array):
    info = []
    for i in array:
#         <th class="first-column" rowspan="1" colspan="1">1975</th>
        print("this is i : " + i)
#         pos1 = i.find('">')
#         pos2 = i.index('</td')
#         info.append(i[pos1:pos2])
#     return info
        
# row_titles = getData(table_rows_str)
# row_titles

table_rows_str

# \S+(owspan="1">+)
# \S+(owspan="1">)+
        

<tr class="odd">
<th class="first-column" colspan="1" rowspan="1">1975</th>
<td colspan="1" rowspan="1">215,973,199</td>
<td colspan="1" rowspan="1">30,601</td>
<td colspan="1" rowspan="1">14.2</td>
<td colspan="1" rowspan="1">7,516</td>
<td colspan="1" rowspan="1">3.5</td>
<td colspan="1" rowspan="1">3,180</td>
<td colspan="1" rowspan="1">1.5</td>
<td colspan="1" rowspan="1">1,003</td>
<td colspan="1" rowspan="1">0.5</td>
<td colspan="1" rowspan="1">916</td>
<td colspan="1" rowspan="1">0.4</td>
<td colspan="1" rowspan="1">44,525</td>
<td colspan="1" rowspan="1">20.6</td>
</tr>
54
70
here
1975


['<tr class="odd">\n<th class="first-column" colspan="1" rowspan="1">1975</th>\n<td colspan="1" rowspan="1">215,973,199</td>\n<td colspan="1" rowspan="1">30,601</td>\n<td colspan="1" rowspan="1">14.2</td>\n<td colspan="1" rowspan="1">7,516</td>\n<td colspan="1" rowspan="1">3.5</td>\n<td colspan="1" rowspan="1">3,180</td>\n<td colspan="1" rowspan="1">1.5</td>\n<td colspan="1" rowspan="1">1,003</td>\n<td colspan="1" rowspan="1">0.5</td>\n<td colspan="1" rowspan="1">916</td>\n<td colspan="1" rowspan="1">0.4</td>\n<td colspan="1" rowspan="1">44,525</td>\n<td colspan="1" rowspan="1">20.6</td>\n</tr>',
 '<tr class="even">\n<th class="first-column" colspan="1" rowspan="1">1976</th>\n<td colspan="1" rowspan="1">218,035,164</td>\n<td colspan="1" rowspan="1">31,724</td>\n<td colspan="1" rowspan="1">14.5</td>\n<td colspan="1" rowspan="1">7,427</td>\n<td colspan="1" rowspan="1">3.4</td>\n<td colspan="1" rowspan="1">3,306</td>\n<td colspan="1" rowspan="1">1.5</td>\n<td colspan="1" rowspan="1">914</

In [84]:

# table_heads_str = str(table_heads)
# re.search('Motor vehicle crash deaths per 100,000 people by type', table_heads_str)
# pos1 = table_heads_str.index('Motor vehicle crash deaths per 100,000 people by type')
# # len(table_heads)
# # len(table_heads_str)
# posEnd = table_heads_str.index("Motor vehicle crash deaths by type, 1975-2018</th>")
# new_table_heads = table_heads_str[pos1:posEnd]
# new_table_heads
# re.findall(, new_table_heads)


# column_titles = table_heads[1:5]
# column_titles

In [None]:
row_titles = table_heads[]

In [None]:
# with open("car_crashes.htm") as file_reader:
#     soup = BeautifulSoup(file_reader, "lxml")