# Introduction
What's the last school shooting you remember? If you're on top of your news, then you'll know it was the Pleasantville, New Jersey shooting, a few weeks ago (from the time this was written). 
Do you remember the details, though?
Probably not.

What about the shooting the day right before Pleasantville? Or the two in October?
The problem is clear: school shootings are a regular occurrence and it's hard to keep track of them all, let alone the details of each one.
In fact, according to Wikipedia, there were 64 US school shootings from 2000-2009, 87 from 2010-2014, and 113 from 2015-November 2019. 
If these numbers seem lower than you expected, remember that these are ONLY school shootings, not mass shootings. The numbers there are staggering: 2,138 since 2013, roughly one a day. The number changes a bit depending on how you define a mass shooting, but either way the point remains: the number of shootings that occur in the U.S. is unacceptably high.

For this project we want to focus on school shootings, as the definition of school shooting is much more clearly defined and can be analyzed easier with a smaller timeframe. Our goal is to find causes of the spike in shootings, especially in the past decade.
It's important to note that none of us want to push any sort of policy or idea or agenda here: we want to stay as neutral as possible.


# References
https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States
https://en.wikipedia.org/wiki/Mass_shootings_in_the_United_States
https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States_(before_2000)#20th_century
https://en.wikipedia.org/wiki/List_of_unsuccessful_attacks_related_to_schools


## Beginning our Analysis
Our goal is to figure out significant school shootings by getting compiled data from Wikipedia and recording certain characteristics about that specific event. With those characteristics, we want to scrape various data sources, such as Fox News, CNN, etc, and see if characteristics of the shooting affect the amount that it is reported among various source.
We recognize that because we have a very small dataset that we may not be able to come up with any significant conclusions, but we believe that it is a start to making steps towards doing so in identifying these reporting patterns.

In [1]:
import pandas as pd
import requests as rq
import numpy as np
import re
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import plotly as py
import plotly.express as px
import time

In [2]:
def makeDF(header, body):
    sol = pd.DataFrame.from_records(body)
    sol.columns = header
    return sol.dropna()

def parseWiki(url):
    school_shooting_url = rq.get(url).text
    soup = BeautifulSoup(school_shooting_url)
    soup.prettify()
    # sortable wikitable is what wikipedia has their tables called
    tables = soup.find_all("table",{"class":"sortable wikitable"})
    header = []
    body = []
    # wikipedia separates by year chunks so we iterate through
    for table in tables:
        for row in table.find_all("tr"):
            temp = []
            if not header:
                for h in row.find_all("th"):
                    header.append(h.get_text().rstrip())
            for col in row.find_all("td"):
                cur = col.get_text()
                # want to get rid of wiki references
                cur = re.sub(r'\[.*\]', '', cur)
                cur = cur.replace("\n", " ")
                noline = cur.rstrip()
                temp.append(noline)
            body.append(temp)
    # the first is the header, so we take it out
    return header, body[1:]

header, body = parseWiki("https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States")
df = makeDF(header, body)

In [3]:
df.head()

Unnamed: 0,Date,Location,Deaths,Injuries,Description
0,"February 29, 2000","Flint, Michigan",1,0,Shooting of Kayla Rolland: At Buell Elementary...
1,"May 26, 2000","Lake Worth, Florida",1,0,"13-year-old honor student, Nathaniel Brazill, ..."
2,"June 28, 2000","Seattle, Washington",2,0,58-year-old Director of the Division of Pathol...
3,"August 28, 2000","Fayetteville, Arkansas",2,0,"36-year-old James Easton Kelly, a PhD candidat..."
4,"September 26, 2000","New Orleans, Louisiana",0,2,"13 year-olds Darrel Johnson, and Alfred Anders..."


In [4]:
df.tail()

Unnamed: 0,Date,Location,Deaths,Injuries,Description
270,"November 15, 2019","Pleasantville, New Jersey",1,2,Five men opened fire during a playoff game bet...
271,"November 23, 2019","Union City, California",2,0,"Two boys, aged 11 and 14-years-old were shot i..."
272,"November 27, 2019","Vancouver, Washington",2,1,Tiffany Hill who was fatally shot by her estra...
273,"December 2, 2019","Waukesha, Washington",0,1,A student reported to Waukesha South High Scho...
274,"December 3, 2019","Oshkosh, Washington",0,2,A 16-year-old student at Oshkosh West High Sch...


Here we look at unnsuccessful attacks in schools:

In [5]:
def parseWiki(url):
    school_shooting_url = rq.get(url).text
    soup = BeautifulSoup(school_shooting_url)
    soup.prettify()
    
    # It's called a wikitable sortable here, for whatever reason.
    tables = soup.find_all("table",{"class":"wikitable sortable"})
    header = []
    body = []
    # wikipedia separates by year chunks so we iterate through
    for table in tables:
        
        for row in table.find_all("tr"):
            temp = []
            if not header:
                for h in row.find_all("th"):
                    header.append(h.get_text().rstrip())
            for col in row.find_all("td"):
                cur = col.get_text()
                # want to get rid of wiki references
                cur = re.sub(r'\[.*\]', '', cur)
                cur = cur.replace("\n", " ")
                noline = cur.rstrip()
                temp.append(noline)
            body.append(temp)
        
    # the first is the header, so we take it out
    return header, body[1:]


header2, body2 = parseWiki("https://en.wikipedia.org/wiki/List_of_unsuccessful_attacks_related_to_schools")
df2 = makeDF(header2, body2)

In [6]:
df2.head()

Unnamed: 0,Date,Location,Description
0,"October 12, 1992","Lincoln, Nebraska, United States","A 43-year-old graduate student, Arthur McElroy..."
1,"November 16, 1998","Burlington, Wisconsin, United States",Five students were arrested the morning they w...
2,May 1999,"Port Huron, Michigan, United States","A 12-year-old, 13-year-old and two 14-year-old..."
3,"January 30, 2001","Cupertino, California, United States",De Anza College student Al DeGuzman planned a ...
4,"February 14, 2001","Elmira, New York, United States","Jeremy Getman, 18, planned a school attack at ..."


In [7]:
df2.tail()

Unnamed: 0,Date,Location,Description
24,"May 1, 2014","Waseca, Minnesota, United States",17-year-old John David LaDue
25,"November 3, 2014","Newcastle upon Tyne, United Kingdom",18-year-old Liam Lyburd was arrested at his ho...
26,"December 1, 2014","Plain City, Utah, United States",A 16-year-old at Fremont High School was arres...
27,"February 14, 2018","Everett, Washington, United States","An 18-year-old, Joshua O'Connor, was arrested ..."
28,"April 17, 2019","Jefferson County, Colorado, United States","An 18-year-old, Sol Pais, flew to Colorado fro..."


### Using APIs to get news sources and attempt to parse data
Now that we have the wikipedia data parsed, we want to set up the data scraping from various sources. We will first look at wikipedia and see if the city that is associated with the shooting ever mentions it.  

In [8]:
# df: the dataframe that has date, location, injuries, deaths, description of shooting
def trace(df):
    city_mentions = []
    total = 0
    ignored = False
    for i, row in df.iterrows():
        try:
            date = datetime.strptime(row['Date'], '%B %d, %Y')
        except:
            ignored = True
            
        loc = (row["Location"]).rstrip().replace(" ", "_")
        url = "https://en.wikipedia.org/wiki/" + loc
        city_url = rq.get(url).text
        soup = BeautifulSoup(city_url)
        soup.prettify()
        # seeing if shooting is mentioned
        shooting_results = soup.body.find(string = re.compile(".*shooting.*"))
        # seeing if shooter is mentioned
        shooter_results = soup.body.find(string = re.compile(".*shooter.*"))
        if shooting_results or shooter_results:
            city_mentions.append(True)
        elif ignored:
            city_mentions.append("N/A")
            ignored = False
        else:
            city_mentions.append(False)
    df["City Mentioned"] = city_mentions
    return df
df3 = trace(makeDF(header, body))

In [9]:
df3.head()

Unnamed: 0,Date,Location,Deaths,Injuries,Description,City Mentioned
0,"February 29, 2000","Flint, Michigan",1,0,Shooting of Kayla Rolland: At Buell Elementary...,True
1,"May 26, 2000","Lake Worth, Florida",1,0,"13-year-old honor student, Nathaniel Brazill, ...",False
2,"June 28, 2000","Seattle, Washington",2,0,58-year-old Director of the Division of Pathol...,False
3,"August 28, 2000","Fayetteville, Arkansas",2,0,"36-year-old James Easton Kelly, a PhD candidat...",False
4,"September 26, 2000","New Orleans, Louisiana",0,2,"13 year-olds Darrel Johnson, and Alfred Anders...",True


In [10]:
df3.tail()

Unnamed: 0,Date,Location,Deaths,Injuries,Description,City Mentioned
270,"November 15, 2019","Pleasantville, New Jersey",1,2,Five men opened fire during a playoff game bet...,False
271,"November 23, 2019","Union City, California",2,0,"Two boys, aged 11 and 14-years-old were shot i...",False
272,"November 27, 2019","Vancouver, Washington",2,1,Tiffany Hill who was fatally shot by her estra...,False
273,"December 2, 2019","Waukesha, Washington",0,1,A student reported to Waukesha South High Scho...,False
274,"December 3, 2019","Oshkosh, Washington",0,2,A 16-year-old student at Oshkosh West High Sch...,False


In [12]:
def parse_city(url):
    city_pop = ""
    city_url = rq.get(url).text
    soup = BeautifulSoup(city_url)
    texts = soup.find_all("tr")
    
    for link in texts:
        if link.get("class") is not None and link.get("class")[0] == "mergedrow"\
        and link.th is not None and link.th.string != "Settled" and link.th.string != "Incorporated":
            if link.a is not None and link.a.string is not None:
                if link.td is not None and link.td.string is not None\
                and link.a.get("title") != "ZIP Code" and link.a.get("title") != "Telephone numbering plan"\
                and link.a.get("title") != "GNIS":  
                    res = link.td.string.split(",")
                    go = True
                    for index in res:
                        try:
                            int(index)
                        except:
                            go = False
                            break
                    if go is True:
                        city_pop = link.td.string  
                        break
            else:
                if link.td is not None and link.td.string is not None:
                    res = link.td.string.split(",")
                    go = True
                    for index in res:
                        try:
                            int(index)
                        except:
                            go = False
                            break
                    if go is True:
                        city_pop = link.td.string  
                        break
    return city_pop

In [13]:
print(parse_city("https://en.wikipedia.org/wiki/Flint,_Michigan"))
print(parse_city("https://en.wikipedia.org/wiki/Lake_Worth_Beach,_Florida"))
print(parse_city("https://en.wikipedia.org/wiki/New_Orleans"))
print(parse_city("https://en.wikipedia.org/wiki/Riverview,_Florida"))

102,434
34,910
343,829
71,050


In [None]:
def add_population(df):
    city_populations = []
    for i, row in df.iterrows():
        if row['Location'] != None:
            loc = (row["Location"]).rstrip().replace(" ", "_")
            url = "https://en.wikipedia.org/wiki/" + loc
            pop = parse_city(url)
            city_populations.append(pop)
        else:
            city_populations.append('None')
    df['Population'] = city_populations
    return df

masterdf = add_population(df3)
masterdf

In [None]:
def plot_data(data):
    count = [0]*20
    years = [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, \
             2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
    casualties = [0]*20
    #print(data)
    for i in range(len(data)):
        row = data.iloc[i]
        count[int(row["Date"][-4:]) % 2000] += 1
        casualties[int(row["Date"][-4:]) % 2000] = (int(row["Deaths"]) + int(row["Injuries"]))
    f = {"Year": years, "Number of Occurences": count, "Number of Deaths and Injuries": casualties}
    d = pd.DataFrame(f)
    fig = px.scatter(d, x="Year", y="Number of Occurences", size="Number of Deaths and Injuries", title="School Shootings")
    return fig
fig = plot_data(df)
fig.update_xaxes(zeroline=False)
fig.update_yaxes(zeroline=False)
fig.show()

In [None]:
def plot_data2(data, bad=True):
    count = [0]*28
    years = [1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, \
             2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, \
             2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
    #print(data)
    for i in range(len(data)):
        row = data.iloc[i]
        if int(row["Date"][-4:]) < 2000:
            count[int(row["Date"][-4:]) - 1992] += 1
        else:
            count[(int(row["Date"][-4:]) % 2000) + 8] += 1
    f = {"Year": years, "Number of Occurences": count}
    d = pd.DataFrame(f)
    fig = px.scatter(d, x="Year", y="Number of Occurences", title='Unnsuccessful attacks related to schools')
    return fig
fig = plot_data2(df2)
fig.update_xaxes(zeroline=False)
fig.update_yaxes(zeroline=False)
fig.show()

In [None]:
def plot_results(data):
    pop = []
    for i in range(len(data)):
        row = data.iloc[i]
        pop.append(row)
    #    count[int(row["Date"][-4:]) % 2000] += 1
    #    casualties[int(row["Date"][-4:]) % 2000] = (int(row["Deaths"]) + int(row["Injuries"]))
    #f = {"Year": years, "Number of Occurences": count, "Number of Deaths and Injuries": casualties}
    #d = pd.DataFrame(f)
    #fig = px.scatter(d, x="Year", y="Number of Occurences", size="Number of Deaths and Injuries", title="School Shootings")
    return fig
fig = plot_results(masterdf)
fig.show()

# Scraping New York Times Articles

Now that we have significant numerical data on these events, let's gauge the impact these events have had on society. We're going to measure societal significance by media coverage of the event, primarily focusing on the number of articles published regarding the event. A larger number of articles published reflects a larger societal impact, while a small number of articles reflects a smaller societal impact. In order to consider the initial and immediate impact, we will only be looking at articles published within a month of the event date.

In [None]:
def format_date(date):
    #format datetime date to be a YYYYMMDD integer
    if date.month < 10 and date.day<10:
        date = '%d0%d%0d'%(date.year,date.month,date.day)
        return int(date)
    elif date.month < 10:
        date = '%d0%d%d'%(date.year,date.month,date.day)
        return int(date)
    elif date.month < 10:
        date = '%d%d0%d'%(date.year,date.month,date.day)
        return int(date)
    else:
        date = '%d%d%d'%(date.year,date.month,date.day)
        return int(date)

def count_articles(location,event_date):
    start_date = datetime.strptime(event_date,'%B %d, %Y')
    end_date = start_date + timedelta(days=30)
    
    start_date = format_date(start_date)
    end_date = format_date(end_date)
    
    api_key = open('nyt_api_key.txt', 'r').readlines()[0]
    query = '%s School Shooting'%location
    
    url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?q= %s&api-key=%s&begin_date=%d&end_date=%d'%(query,api_key,start_date,end_date)
    
    request = rq.get(url)
    json_data = request.json()
    print(location,event_date)
    num_articles = json_data['response']['meta']['hits']
    return num_articles
    

In [None]:
def add_article_data(df):
    article_data = []
    for i, row in df.iterrows():
        location = row['Location']
        event_date = row['Date']
        num_articles = count_articles(location,event_date)
        article_data.append(num_articles)
        time.sleep(5)
    df['Number of Articles Published'] = article_data

add_article_data(df)
    