# ds-gdi-crime-article

The goal of this project is to extract the descriptions of victims, alleged perpetrators, and officers from about 7,000 print news articles. After extracting the descriptions from the article, perform exploratory analysis on the data to determine the types of victims, perpetrators, officers, and crimes written in the articles.

## First Sandwich

Make sure the collection of news articles named `data.csv` is in the same folder as this notebook.

a) Reads in `data.csv` and returns the DataFrame.

In [3]:
import pandas
import pandas as pd

df = pd.read_csv('data.csv')
df.describe()

Unnamed: 0,id,publish date
count,6661.0,0.0
mean,3331.0,
std,1923.009404,
min,1.0,
25%,1666.0,
50%,3331.0,
75%,4996.0,
max,6661.0,


b) Drop any columns that you don't need for the web scraping.

In [4]:
df = df.drop(columns=['station', 'date segment', 'publish date'])
df.head()

Unnamed: 0,id,link
0,1,https://abc7.com/5434013
1,2,https://abc7.com/5433836
2,3,https://abc7.com/5433746
3,4,https://abc7.com/5433587
4,5,https://abc7.com/5432778


c) Iterate over each row so that we can feed each link into a web scraper later.

In [5]:
dataset = []
# iterating over rows using iterrows() function
for i, j in df.iterrows():
    # j[1] for link
    dataset.append(j[1])
    # print(j[1])
    # print()

print(len(dataset))
print(dataset)

6661


d) Recall that the articles come from a total of three different sources: KABC, WABC, and WLS. Try parsing the title, date, and article text of a KABC article and put them into a dict.

In [6]:
import requests

res = {}

# our KABC testing url
url = "https://abc7.com/5434013"

response = requests.get(url)
# print(response.text)

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

article_title = soup.find("h1", class_="headline")
res["title"] = article_title.text.strip()
print(article_title.text.strip())
print()

if (soup.find('div', class_="report-typo")) is not None:
    soup.find('div', class_="report-typo").decompose()

from dateutil import parser
time = soup.find("meta", property="article:modified_time")

if time is not None:
    yourdate = parser.parse(time["content"])
    res["date"] = yourdate.strftime('%Y-%m-%d')
    print(yourdate.strftime('%Y-%m-%d'))
    # print(time["content"] if time else "No meta title given")
    print()

article_text = soup.find("div", class_="body-text")
res["text"] = article_text.text.strip()
# print(article_text.prettify())
print(article_text.text.strip())
print()

tags = soup.find("div", class_="story-taxonomy")

tags_text = []
for tag in tags:
    tdTags = tag.find_all("a", {"class": "button"})
    for tag in tdTags:
        tags_text.append(tag.text.strip())
        # print(tag.text)
res["tags"] = tags_text
print(tags_text)
print()

print(res)

Transient sentenced to 6 years in attack that preceded Oxnard grandmother's death

2019-08-01

OXNARD, Calif. (KABC) -- A transient who smiled before attacking a 71-year-old grandmother and an 80-year-old man was sentenced to six years in prison Wednesday.The grandmother, Armida Castro, died a week after she was hospitalized for her injuries.In an emotional hearing in Ventura County Superior Court, Castro's family asked for the strongest penalty for 56-year-old defendant Adam Barcenas."We are hoping that you can see how heinous his act was, beating and kicking a helpless ailing woman. This was not merely abuse. This was an evil cowardly act," said Castro's son-in-law, Jose Alejandro Navarette.Castro's family can't understand why Barcenas was not tried for homicide.According to the medical examiner Castro died eight days after the attack because of a blood clot that was caused by her injuries. But the opinion was not conclusive. The Ventura County District Attorney says Castro had pre-e

e) Now try parsing the title, date, and article text of a WABC article and put them into a dict.

In [7]:
res = {}

# our WABC testing url
url = "https://abc7ny.com/$1m-worth-of-liquid-meth-found-hidden-in-snow-globes/5418660/"
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

article_title = soup.find("h1", class_="headline")
res["title"] = article_title.text.strip()
print(article_title.text.strip())
print()

soup.find('div', class_="report-typo").decompose()

time = soup.find("meta", property="article:modified_time")
yourdate = parser.parse(time["content"])
res["date"] = yourdate.strftime('%Y-%m-%d')
print(yourdate.strftime('%Y-%m-%d'))
print()

# print(url.split('/'))
# print()

article_text = soup.find("div", class_="body-text")
res["text"] = article_text.text.strip()
print(article_text.text.strip())
print()

print(res)

$1M worth of liquid meth found hidden in snow globes at Australian border

2019-07-25




f) Now try parsing the title, date, and article text of a WLS article and put them into a dict.

In [8]:
res = {}

# our WABC testing url
url = "https://abc7chicago.com/5433655"
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

article_title = soup.find("h1", class_="headline")
res["title"] = article_title.text.strip()
print(article_title.text.strip())
print()

soup.find('div', class_="report-typo").decompose()

time = soup.find("meta", property="article:modified_time")
yourdate = parser.parse(time["content"])
res["date"] = yourdate.strftime('%Y-%m-%d')
print(yourdate.strftime('%Y-%m-%d'))
print()

article_text = soup.find("div", class_="body-text")
res["text"] = article_text.text.strip()
print(article_text.text.strip())
print()

print(res)

Police investigating Maywood double murder; 1 victim recently lost stepson to gun violence

2019-07-31

MAYWOOD, Ill. (WLS) -- One of the two people killed in a double murder in west suburban Maywood lost his stepson to violence just four months ago.Police said just before 6:30 p.m. Tuesday officers responded to the 1200-block of 13th Avenue for reports of shots fired in the area. Officers found Yarnell M. White and Dean Stansberry in need of medical attention. They were taken to Loyola Medical Center, both in critical condition, where they died.Stansberry was still grieving his stepson's death when he was killed. Just last week, Father Michael Pfleger was with Stansberry's family. His wife stood with the priest pleading for leads in the murder of her son Isiah Scott.Scott was 19 when he was killed two months before his high school graduation.Stansberry and his wife have the support of the organization Purpose Over Pain, which offered a $5,000 reward for information leading to the arre

g) Now that we can successfully parse an article from each source, create a function that can parse an article and return it as a dataframe.

In [9]:
def parse_article_url(url):
    df_res = pd.DataFrame()
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.find("h1", class_="headline")
    if title is not None:
        title_text = title.text.strip()
    else:
        title_text = ''

    # df_res["title"] = article_title.text.strip()

    if (soup.find('div', class_="report-typo")) is not None:
        soup.find('div', class_="report-typo").decompose()

    time = soup.find("meta", property="article:modified_time")

    if time is not None:
        yourdate = parser.parse(time["content"])
        date_text = yourdate.strftime('%Y-%m-%d')
    else:
        date_text = ''
        # date_text = '2000-01-01T00:00:00Z'
    # df_res["date"] = yourdate.strftime('%Y-%m-%d')

    # station
    station = ''
    if url.split('/')[2] == 'abc7.com':
        station = 'KABC'
    elif url.split('/')[2] == 'abc7ny.com':
        station = 'WABC'
    elif url.split('/')[2] == 'abc7chicago.com':
        station = 'WLS'

    article = soup.find("div", class_="body-text")
    if article is not None:
        article_text = article.text.strip()
    else:
        article_text = ''
    # df_res["text"] = article_text.text.strip()

    tags = soup.find("div", class_="story-taxonomy")

    if tags is not None:
        tags_text = []
        for tag in tags:
            tdTags = tag.find_all("a", {"class": "button"})
            for tag in tdTags:
                tags_text.append(tag.text.strip())
    else:
        tags_text = [""]

    df_res = df_res.append(
        {"Title": title_text, "Date": date_text, "Station": station, "Text": article_text, "Tags": tags_text, "Word Count": len(article_text)},
        ignore_index = True)

    return df_res

df = parse_article_url("https://abc7.com/5434013")
df.head()

Unnamed: 0,Title,Date,Station,Text,Tags,Word Count
0,Transient sentenced to 6 years in attack that ...,2019-08-01,KABC,"OXNARD, Calif. (KABC) -- A transient who smile...","[oxnard, ventura county, elderly woman, senten...",1893.0


e) For deliverable 1, parse each of the sample article URLs and concatenate the results into one large dataframe and convert it into a csv that we can show to the client.

In [10]:
sample_urls = ["https://abc7.com/5434013", "https://abc7.com/5433836", "https://abc7.com/5433746", "https://abc7.com/5433587",	"https://abc7.com/5432778", "https://abc7ny.com/$1m-worth-of-liquid-meth-found-hidden-in-snow-globes/5418660/", "https://abc7ny.com/10-year-old-severely-bitten-while-riding-home-on-school-bus/5292384/", "https://abc7ny.com/11-men-in-custody-1-at-large-after-drug-bust-in-mount-vernon/5244403/", "https://abc7ny.com/12-year-old-critically-injured-in-nj-hit-and-run-crash/5404024/", "https://abc7ny.com/13-philly-cops-to-be-fired-after-facebook-post-investigation/5403870/", "https://abc7chicago.com/5433655", "https://abc7chicago.com/5433635", "https://abc7chicago.com/5432374", "https://abc7chicago.com/5432198", "https://abc7chicago.com/5432152"]

df = pd.DataFrame()

for url in sample_urls:
    df_url = parse_article_url(url)
    # print(df_url)
    df = pd.concat([df, df_url], ignore_index=True)

# print(df.head())
df.to_csv('sample_parse_with_tags.csv')

f) For deliverable 4, we need to go back here and add word count to the dataframe. Let's see if that's possible.

In [10]:
# see part (g)

## Second Sandwich

a) For deliverable 2, let's start by parsing all of the URLs in the dataset now that our parser is stable and approved.

In [12]:
df = pd.DataFrame()

for url in dataset:
    df_url = parse_article_url(url)
    df = pd.concat([df, df_url], ignore_index=True)

df.to_csv('dataset_parse.csv')

Now that we have our data, let's clean it up.

b) Remove all rows with no article text (these are most likely video articles with no text).

In [13]:
import pandas as pd

df = pd.read_csv('dataset_parse.csv')
print(len(df))

# df.drop(df.loc[df['Text']==None].index, inplace=True)
df = df[df.Text.notnull()]
df = df.iloc[: , 1:]
print(len(df))
df = df.reset_index(drop=True)
df.to_csv('dataset_parse_drop.csv')
df.head()

6661
6305


Unnamed: 0,Title,Date,Station,Text,Tags,Word Count
0,Transient sentenced to 6 years in attack that ...,2019-08-01,KABC,"OXNARD, Calif. (KABC) -- A transient who smile...","['oxnard', 'ventura county', 'elderly woman', ...",1893.0
1,Monrovia kidnapping: Search for missing woman ...,2019-08-01,KABC,"MONROVIA, Calif. (KABC) -- Authorities are ask...","['monrovia', 'los angeles county', 'downtown l...",2739.0
2,"LAPD dashcam, bodycam videos show arrest of su...",2019-07-31,KABC,SOUTH LOS ANGELES (KABC) -- The Los Angeles Po...,"['south los angeles', 'los angeles county', 'l...",1320.0
3,Southern California's most wanted: FBI focusin...,2019-08-01,KABC,LOS ANGELES (KABC) -- You could call them the ...,"['los angeles', 'los angeles county', 'souther...",2322.0
4,Indian coffee shop chain owner V.G.Siddhartha'...,2019-08-01,KABC,"BANGALORE, India -- Fishermen on Wednesday fou...","['u.s. & world', 'coffee', 'death investigation']",1728.0


c) Next, let's run some NER (Named Entity Recognition) with Spacy on the article headline titles to see what initial information it can gather, and then we'll go from there.

In [None]:
import spacy
import time
nlp = spacy.load('en_core_web_lg')

In [None]:
headlines = ["Transient sentenced to 6 years in attack that preceded Oxnard grandmother's death",
             "Monrovia kidnapping: Search for missing woman focuses on Mount Baldy area",
             "LAPD dashcam, bodycam videos show arrest of suspect allegedly armed with machete in South Los Angeles",
             "Southern California's most wanted: FBI focusing on tracking down dozen dangerous fugitives",
             "Indian coffee shop chain owner V.G.Siddhartha's body found in river by fisherman",
             "Riverside possible abduction: Sisters, ages 18 months and 8 months, possibly taken by mother's boyfriend, police say",
             "Man arrested for second attempted kidnapping in San Jacinto",
             "Funeral set for off-duty LAPD officer killed in Lincoln Heights",
             "Simi Valley police seeking suspect who exposed himself to hotel housekeeper",
             "Long Island woman charged in murder-for-hire plot against ex-husband's mother, 5-year-old daughter",
             "Monrovia kidnapping suspect in custody after hourslong standoff with SWAT team in downtown Los Angeles",
             ]

docs = [nlp(headline) for headline in headlines]

for doc in docs:
    print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
ner_titles = []

for index, row in df.iterrows():
    # dataset.append(j[1])
    ner_titles.append([(ent.text, ent.label_) for ent in nlp(row['Title']).ents])

import numpy as np

# np.savetxt("ner_titles.csv", ner_titles, delimiter=", ", fmt ='% s')

with open('ner_titles.txt', 'w', encoding="utf-8") as f:
    for item in ner_titles:
        f.write("%s\n" % item)

# df = pd.DataFrame()
#
# for url in dataset:
#     df_url = parse_article_url(url)
#     df = pd.concat([df, df_url], ignore_index=True)
#
# df.to_csv('dataset_parse.csv')

In [None]:
ner_texts = []

for index, row in df.iterrows():
    # dataset.append(j[1])
    ner_texts.append([(ent.text, ent.label_) for ent in nlp(row['Text']).ents])

# np.savetxt("ner_texts.csv", ner_texts, delimiter=", ")

with open('ner_texts.txt', 'w', encoding="utf-8") as f:
    for item in ner_texts:
        f.write("%s\n" % item)