## SemEval 2019 Task 4 - Extra Preprocessing Steps Exploration

Jonathan Miller and Negar Adyaniyazdi, VCU, CMSC516, Fall 2018

Goal: Remove foreign articles, URLs, named entity recognition

In [4]:
import pandas as pd

In [6]:
DATA_PATH = '../data/'
DATA_INTERIM_PATH = DATA_PATH + 'interim/'

train = pd.read_csv(DATA_INTERIM_PATH + 'train.csv')
val = pd.read_csv(DATA_INTERIM_PATH + 'val.csv')

Search for a common Spanish word in df

In [43]:
foreign = train[train['article_text'].str.contains('cuando')]
foreign.reset_index(inplace=True)

In [48]:
foreign.head()

Unnamed: 0,index,id,published-at,title,hyperpartisan,bias,url,labeled-by,article_text,language
0,1097,2335,2012-05-01,Abusando la placa policial,False,left-center,http://chicagoreporter.com/abusando-la-placa-p...,publisher,Abusando la placa policial Glenn Evans observa...,spanish
1,1103,2346,2017-12-31,Marihuana legal en California: lo que hay que ...,False,least,https://apnews.com/068da65970814762b9e2275c5a7...,publisher,Marihuana legal en California: lo que hay que ...,spanish
2,1234,2636,2017-12-26,Deportistas no dudaron en ayudar tras desastre...,False,least,https://apnews.com/7ee1074b7da64d0285e0c5fdb6f...,publisher,Deportistas no dudaron en ayudar tras desastre...,spanish
3,1866,3976,2017-12-28,Putin: explosi?n en San Petersburgo fue un ata...,False,least,https://apnews.com/amp/c7d066f684ec400eb507c5b...,publisher,Putin: explosi?n en San Petersburgo fue un ata...,spanish
4,1963,4163,2018-01-25,Cilic avanza a la final del Abierto de Australia,False,least,https://apnews.com/8bb81c385c744a9ebfc48a5c94f...,publisher,Cilic avanza a la final del Abierto de Austral...,spanish


The following solution for language detection taken from: http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/

In [38]:
from nltk.corpus import stopwords
from nltk import wordpunct_tokenize

In [39]:
def _calculate_languages_ratios(text):
    """
    Calculate probability of given text to be written in several languages and
    return a dictionary that looks like {'french': 2, 'spanish': 4, 'english': 0}
    
    @param text: Text whose language want to be detected
    @type text: str
    
    @return: Dictionary with languages and unique stopwords seen in analyzed text
    @rtype: dict
    """

    languages_ratios = {}

    '''
    nltk.wordpunct_tokenize() splits all punctuations into separate tokens
    
    >>> wordpunct_tokenize("That's thirty minutes away. I'll be there in ten.")
    ['That', "'", 's', 'thirty', 'minutes', 'away', '.', 'I', "'", 'll', 'be', 'there', 'in', 'ten', '.']
    '''

    tokens = wordpunct_tokenize(text)
    words = [word.lower() for word in tokens]

    # Compute per language included in nltk number of unique stopwords appearing in analyzed text
    for language in stopwords.fileids():
        stopwords_set = set(stopwords.words(language))
        words_set = set(words)
        common_elements = words_set.intersection(stopwords_set)

        languages_ratios[language] = len(common_elements) # language "score"

    return languages_ratios

In [40]:
def detect_language(text):
    """
    Calculate probability of given text to be written in several languages and
    return the highest scored.
    
    It uses a stopwords based approach, counting how many unique stopwords
    are seen in analyzed text.
    
    @param text: Text whose language want to be detected
    @type text: str
    
    @return: Most scored language guessed
    @rtype: str
    """

    ratios = _calculate_languages_ratios(text)

    most_rated_language = max(ratios, key=ratios.get)

    return most_rated_language

In [45]:
foreign['language'] = foreign['article_text'].apply(detect_language)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [51]:
some_train = train.sample(5000)

In [52]:
some_train['language'] = some_train['article_text'].apply(detect_language)

In [53]:
some_train['language'].value_counts()

english        4961
spanish          29
azerbaijani       6
french            2
italian           1
danish            1
Name: language, dtype: int64

In [55]:
some_train[~(some_train['language'] == 'english')]

Unnamed: 0,id,published-at,title,hyperpartisan,bias,url,labeled-by,article_text,language
473358,934224,2018-01-25,FMI: Venezuela afecta promedio de crecimiento ...,False,least,https://apnews.com/amp/ee310eab95ca40879560afd...,publisher,FMI: Venezuela afecta promedio de crecimiento ...,spanish
303260,612992,2018-01-16,Abre nueva embajada de EEUU en Londres critica...,False,least,https://apnews.com/amp/2b9e6018fe6e49129c26474...,publisher,Abre nueva embajada de EEUU en Londres critica...,spanish
155296,304998,2018-01-10,Pa?ses del r?o Mekong discuten proyectos de re...,False,least,https://apnews.com/amp/2e76de49bec7440c941f32d...,publisher,Pa?ses del r?o Mekong discuten proyectos de re...,spanish
351291,728985,2018-01-25,Fiat Chrysler casi duplica sus ganancias en 2017,False,least,https://apnews.com/0f1f80e335e0498da46c6d93549...,publisher,Fiat Chrysler casi duplica sus ganancias en 20...,spanish
554272,1064703,2016-10-24,National Liberty Federation added a new photo.,True,right,http://libertyfederation.org/national-liberty-...,publisher,National Liberty Federation added a new photo....,azerbaijani
271376,546657,2018-01-22,Bucks despiden al entrenador Jason Kidd,False,least,https://apnews.com/94a45ad8f9ab45199394b1ae3ec...,publisher,Bucks despiden al entrenador Jason Kidd MILWAU...,spanish
85851,160320,2016-09-12,National Liberty Federation added a new photo.,True,right,http://libertyfederation.org/national-liberty-...,publisher,National Liberty Federation added a new photo....,azerbaijani
266825,537127,2018-01-06,Dem?cratas en los estados preparan respuestas ...,False,least,https://apnews.com/amp/acc3bd7adda046419e2da44...,publisher,Dem?cratas en los estados preparan respuestas ...,spanish
236406,473856,2018-01-31,"S&P 500, Nasdaq Futures -- Technical Analysis",True,right,http://foxbusiness.com/features/2017/04/21/s-p...,publisher,"S&P 500, Nasdaq Futures -- Technical Analysis ...",danish
102493,195085,2018-01-17,CIDH pide siga investigaci?n muerte de argenti...,False,least,https://apnews.com/d9c0dbaa1c4b4436926520659b3...,publisher,CIDH pide siga investigaci?n muerte de argenti...,spanish


In [72]:
some_train[some_train['language'] == 'italian'].reset_index().iloc[0]['article_text']

' Khi Robert Lewandowski b?t ng? ??nh ti?ng v?i c?c ??i gia Premier League, Bayern ?? quy?t ??nh chi ??m ?? mang b?ng ???c ch?n s?t n?y v? Allianz Arena. \nH?i ??u th?ng, Lewandowski tuy?n b? s? t?i Bayern nh?ng th?i gian qua, l?i b?t ng? ng? ? mu?n ???c thi ??u t?i Premier League. Tr??c th?i ?? ?m ? c?a Lewandowski, ban l?nh ??o Bayern ?? quy?t ??nh ph? k?t ?? tr?i ch?n ti?n ??o n?y. \nL??ng cao g?n b?ng Ribery \nTheo t? Daily Mail, ch?a k? ti?n th??ng, Bayern s? chi 11 tri?u euro ti?n l?t tay ??ng th?i tr? l??ng h?n 9 tri?u euro m?i n?m ?? c? ???c ch? k? c?a Lewandowski. S? ti?n n?y cao g?n g?p hai l?n m?c l??ng Lewandowski ?ang nh?n ? Dortmund. So v?i t?i Bayern, m?c l??ng n?y x?p x? c?c ng?i sao h?ng ??u nh? Franck Ribery, Philipp Lahm v? nh?nh h?n Mario Goetze kho?ng hai tri?u euro. Bayern s? k? h?p ??ng 4 n?m v?i Lewandowski v? c? ?i?u kho?n gia h?n th?m 12 th?ng. T?c ?? c? Lewandowski trong 4 m?a t?i, Bayern s? ph?i chi kho?ng g?n 50 tri?u euro, ch?a k? ti?n thu? thu nh?p. \nDo 

In [73]:
some_train[some_train['language'] == 'french'].reset_index().iloc[0]['article_text']

"Saturday's Scores Adrian Lenawee Christian 78, Tol. Christian, Ohio 59 \nAlmont 56, Imlay City 31 \nAnn Arbor Huron 75, Detroit Western International 50 \nAnn Arbor Skyline 87, Ann Arbor Pioneer 67 \nBridgeport 63, Essexville Garber 53 \nClarkston 84, Detroit Pershing 50 \nCoopersville 59, Muskegon Heights 47 \nDetroit Renaissance 75, Flint Beecher 63 \nDundee 68, Temperance Bedford 65 \nHarper Woods Chandler Park Academy 57, DCP-Northwestern 56 \nHudsonville Unity Christian 49, Holland Christian 30 \nIllinois Lutheran, Ill. 46, St. Joseph Michigan Lutheran 33 \nMount Pleasant Sacred Heart 53, Saginaw Nouvel 40 \nRoyal Oak Shrine 56, Madison Heights Bishop Foley 46 \nWarren De La Salle 47, Detroit Edison(DEPSA) 46 \nWebberville 57, Burr Oak 50 \nWest Branch Ogemaw Heights 50, Lincoln-Alcona 48 \nYpsilanti Arbor Preparatory 46, Ann Arbor Gabriel Richard 35 \nAdrian Lenawee Christian 78, Tol. Christian, Ohio 59 \nAlmont 56, Imlay City 31 \nAnn Arbor Huron 75, Detroit Western Internation

In [75]:
some_train[some_train['language'] == 'azerbaijani'].reset_index().iloc[0]['article_text']

'National Liberty Federation added a new photo.  National Liberty Federation added a new photo. \n Source \nNational Liberty Federation added a new photo.September 13, 2016In "Conservative Blogs" \nNational Liberty Federation added a new photo.September 13, 2016In "Conservative Blogs" \nNational Liberty Federation added a new photo.September 13, 2016In "Conservative Blogs"'

Most Spanish-flagged articles seem to actually be Spanish, but other languages appear to be either languages that did not read in correctly (Vietnamese) or anomaly articles such as scores, lists of names, stock prices, etc

This is useful even if the language is not correct because these articles are mostly noise

Examine another sample

In [77]:
some_train = train.sample(5000, random_state=27)
some_train['language'] = some_train['article_text'].apply(detect_language)
some_train['language'].value_counts()

english        4946
spanish          40
azerbaijani       6
french            5
hungarian         2
romanian          1
Name: language, dtype: int64

In [78]:
some_train[~(some_train['language'] == 'english')]

Unnamed: 0,id,published-at,title,hyperpartisan,bias,url,labeled-by,article_text,language
739853,1369341,2016-09-13,National Liberty Federation added a new photo.,True,right,http://libertyfederation.org/national-liberty-...,publisher,National Liberty Federation added a new photo....,azerbaijani
23887,51098,2018-01-15,Aviones brit?nicos vigilan acercamiento de caz...,False,least,https://apnews.com/amp/34ad22d4ff524304883b6a5...,publisher,Aviones brit?nicos vigilan acercamiento de caz...,spanish
381844,786541,2018-01-19,Controversia en Manhattan por plan de cobrar p...,False,least,https://apnews.com/2c418e567b2649bd8b27275c3a7...,publisher,Controversia en Manhattan por plan de cobrar p...,spanish
157706,309953,2018-01-19,Cancelan viaje de artistas norcoreanas a Corea...,False,least,https://apnews.com/amp/a932e71d9cc8440c9385031...,publisher,Cancelan viaje de artistas norcoreanas a Corea...,spanish
130699,253450,2018-01-06,Friday?s Scores,False,least,https://apnews.com/a72d0f3a7e01416790d96173aa2...,publisher,"Friday?s Scores Amanda-Clearcreek 51, Baltimor...",french
188076,373090,2017-12-30,"Salah brilla de nuevo por Liverpool, Chelsea a...",False,least,https://apnews.com/96e40c2c5fd444df81bdcf38c1e...,publisher,"Salah brilla de nuevo por Liverpool, Chelsea a...",spanish
701504,1300827,2018-01-23,L?der dem?crata retira oferta de financiar mur...,False,least,https://apnews.com/c2a4c1dea292495d86ce95e8bf2...,publisher,L?der dem?crata retira oferta de financiar mur...,spanish
476741,939656,2018-01-06,Friday?s Scores,False,least,https://apnews.com/06ff74d75dd34cf9978c750fc33...,publisher,"Friday?s Scores Chapmanville 65, Logan 46 \nMi...",french
735542,1360304,2018-01-04,Islandia exige a empresas pagar igual a mujere...,False,least,https://apnews.com/806621071ab546b6b0efe64f105...,publisher,Islandia exige a empresas pagar igual a mujere...,spanish
704479,1305610,2018-01-12,Donaldson pacta por 1 a?o y 23 millones con Az...,False,least,https://apnews.com/19cedc59bdb44bc0be40becc061...,publisher,Donaldson pacta por 1 a?o y 23 millones con Az...,spanish


In [90]:
some_train[some_train['language'] == 'spanish'].reset_index()

Unnamed: 0,index,id,published-at,title,hyperpartisan,bias,url,labeled-by,article_text,language
0,23887,51098,2018-01-15,Aviones brit?nicos vigilan acercamiento de caz...,False,least,https://apnews.com/amp/34ad22d4ff524304883b6a5...,publisher,Aviones brit?nicos vigilan acercamiento de caz...,spanish
1,381844,786541,2018-01-19,Controversia en Manhattan por plan de cobrar p...,False,least,https://apnews.com/2c418e567b2649bd8b27275c3a7...,publisher,Controversia en Manhattan por plan de cobrar p...,spanish
2,157706,309953,2018-01-19,Cancelan viaje de artistas norcoreanas a Corea...,False,least,https://apnews.com/amp/a932e71d9cc8440c9385031...,publisher,Cancelan viaje de artistas norcoreanas a Corea...,spanish
3,188076,373090,2017-12-30,"Salah brilla de nuevo por Liverpool, Chelsea a...",False,least,https://apnews.com/96e40c2c5fd444df81bdcf38c1e...,publisher,"Salah brilla de nuevo por Liverpool, Chelsea a...",spanish
4,701504,1300827,2018-01-23,L?der dem?crata retira oferta de financiar mur...,False,least,https://apnews.com/c2a4c1dea292495d86ce95e8bf2...,publisher,L?der dem?crata retira oferta de financiar mur...,spanish
5,735542,1360304,2018-01-04,Islandia exige a empresas pagar igual a mujere...,False,least,https://apnews.com/806621071ab546b6b0efe64f105...,publisher,Islandia exige a empresas pagar igual a mujere...,spanish
6,704479,1305610,2018-01-12,Donaldson pacta por 1 a?o y 23 millones con Az...,False,least,https://apnews.com/19cedc59bdb44bc0be40becc061...,publisher,Donaldson pacta por 1 a?o y 23 millones con Az...,spanish
7,424311,855033,2018-01-23,Nadal se retira del Abierto de Australia por l...,False,least,https://apnews.com/dbbe49cf793943819d7aaaad06f...,publisher,Nadal se retira del Abierto de Australia por l...,spanish
8,543420,1047219,2018-01-25,Corte UE veta pruebas de sexualidad a solicita...,False,least,https://apnews.com/bf8a368f0b634c178d3ebd04e1e...,publisher,Corte UE veta pruebas de sexualidad a solicita...,spanish
9,93508,176283,2018-01-08,Polic?a de EEUU sopesa la reventa de armas con...,False,least,https://apnews.com/f3ccd457400f424984f47847b87...,publisher,Polic?a de EEUU sopesa la reventa de armas con...,spanish


Also, it seems that all Spanish articles in the dataset come from apnews. In the baseline logistic regression classifier, apnews was the number one word to identify nonpartisan articles, since all apnews publications are nonpartisan

Examine apnews articles in the dataframe

In [92]:
train[train['url'].str.contains('apnews')].shape

(75805, 8)

In [96]:
train[(train['url'].str.contains('apnews')) & (train['article_text'].str.contains('apnews'))].shape

(2990, 8)

In [97]:
import time
start = time.time()

some_train = train.sample(5000, random_state=27)
some_train['language'] = some_train['article_text'].apply(detect_language)
some_train['language'].value_counts()

start - time.time()

-41.16928720474243

Detecting URLs in article text

In [98]:
some_train[some_train['article_text'].str.contains('http://')]

Unnamed: 0,id,published-at,title,hyperpartisan,bias,url,labeled-by,article_text,language
287453,580359,2011-09-09,Obama Gave up on a Detroit Green Machine,True,left,http://therealnews.com/t2/index.php?option%3Dc...,publisher,Obama Gave up on a Detroit Green Machine Frank...,english
564311,1080795,2011-06-22,Dirty Water: It?s a State?s Right!,True,left,https://motherjones.com/politics/2011/06/bipar...,publisher,Dirty Water: It?s a State?s Right! Photo by ml...,english
531592,1028339,,"To clean up coal, Obama pushes more oil produc...",False,least,https://abqjournal.com/325376/to-clean-up-coal...,publisher,"To clean up coal, Obama pushes more oil produc...",english
745425,1380830,2016-03-06,Did government go overboard in prosecuting fis...,True,right,http://foxbusiness.com/markets/2014/11/05/did-...,publisher,Did government go overboard in prosecuting fis...,english
145662,284981,2010-04-16,Kent State Anniversary Blues,True,left,https://counterpunch.org/2010/04/16/kent-state...,publisher,"Kent State Anniversary Blues In my book, Magic...",english
572289,1093649,2018-01-04,Authorities: Gunman who killed deputy had seve...,False,least,https://apnews.com/9b156fd085af4c2abe312c9f235...,publisher,Authorities: Gunman who killed deputy had seve...,english
564974,1081825,2016-10-31,Bauer ice hockey gear maker files bankruptcy i...,True,right,http://foxbusiness.com/features/2016/10/31/bau...,publisher,Bauer ice hockey gear maker files bankruptcy i...,english
551307,1059893,2012-10-14,,True,right,http://govtslaves.info/max-keiser-monsanto-sho...,publisher,\n \nENJOY THIS STORY? \nGet news like this ...,english
634719,1193929,,Head of State?s Medical Marijuana Program Quits,False,least,https://abqjournal.com/68940/head-of-state%25e...,publisher,Head of State?s Medical Marijuana Program Quit...,english
757815,1406905,2011-03-16,Choosing Chicago?s next Schools CEO: Robert Ru...,False,left-center,http://chicagoreporter.com/choosing-chicagos-n...,publisher,Choosing Chicago?s next Schools CEO: Robert Ru...,english


In [99]:
some_train[some_train['article_text'].str.contains('http://')].reset_index()['article_text'][0]

"Obama Gave up on a Detroit Green Machine Frank Hammer is a retired General Motors employee and former President and Chairman of Local 909 in Warren, Michigan. He now organizes with the Auto Worker Caravan, an association of active and retired auto workers who advocate for workers demands in Washington. http://www.asotrecol.org/ \n \n \n \n PAUL JAY, SENIOR EDITOR, TRNN: Welcome to The Real News Network. I'm Paul Jay in Washington. This is part two of our interview with Frank Hammer. We're discussing President Obama's address on Labor Day in Detroit. Thanks for joining us again, Frank. \n \nFRANK HAMMER, FMR. PRESIDENT, UAW LOCAL 909: Good to be with you again. \n \nJAY: Frank is a retired autoworker. He used to be a president of UAW local in Detroit, and he's an activist working with the Autoworker Caravan. So let's go back to something some of the workers say to you, you told me off-camera beforehand, is, you know, President Obama really didn't have a choice. This--you know, at least

In [100]:
import sys
sys.path.append('../src/data')

%load_ext autoreload
%autoreload 1

import preprocess
%aimport preprocess

In [101]:
preprocess.normalize_corpus([some_train[some_train['article_text'].str.contains('http://')].reset_index()['article_text'][0]])

['obama give detroit green machine frank hammer retired general motor employee former president chairman local warren michigan organize auto worker caravan association active retired auto worker advocate worker demand washington httpwww asotrecol org paul jay senior editor trnn welcome real news network paul jay washington part two interview frank hammer discuss president obamas address labor day detroit thank join us frank frank hammer fmr president uaw local good jay frank retire autoworker use president uaw local detroit activist work autoworker caravan let us go back something worker say tell camera beforehand know president obama really not choice know least good whole industry close really could kind structuring not really political possibility else could answer question hammer well think couple thing one good thing obama probably not mention much lift cafe standard considerably support uaw way sort guarantee auto industry go forward go go much electric hybrid car essential regar

After preprocessing, http://www.asotrecol.org/ becomes httpwww asotrecol org

The domain of linked URLs could be considered non-noise, so we should extract the domain

In [103]:
text = some_train[some_train['article_text'].str.contains('http://')].reset_index()['article_text'][0]

In [109]:
import re

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)
urls

['http://www.asotrecol.org']

In [110]:
import re
import tldextract

def find_extract_urls(text):
    urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)
    for url in urls:
        tld = tldextract.extract(url)[1]
        print(tld)

In [111]:
find_extract_urls(text)

asotrecol


In [120]:
re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', some_train[some_train['article_text'].str.contains('http://')].reset_index()['article_text'][9])

['http://www.surveymonkey.com']

In [122]:
some_train[some_train['article_text'].str.contains('http://')].reset_index()['article_text'][9]

'Choosing Chicago?s next Schools CEO: Robert Runcie, Timothy Knowles, John White Catalyst Chicago is asking readers to submit the names of candidates they believe would be a good pick to run the Chicago Public Schools. In the coming weeks, we?ll post short profiles of the candidates. We?re inviting other readers to share their views in our ?Comments? section below. \nCatalyst Chicago is asking readers to submit the names of candidates they believe would be a good pick to run the Chicago Public Schools. In the coming weeks, we?ll post short profiles of the candidates. We?re inviting other readers to share their views in our ?Comments? section below. \nRobert Runcie, chief officer for Area 17 Experience: Robert Runcie has served as chief information officer and chief administrative officer for CPS. He is now the chief officer for Area 17, a group of elementary schools on the Southeast Side. He is a 2009 graduate of the prestigious Broad Superintendents Academy, a 10-month executive train