# Language Detection

The tweet data provided to NBC did not include the language column (lang) this is usually included from the twitter API. To tease out this feature I will use the LanguageDetector in the [Spacy_cld](https://github.com/nickdavidhaynes/spacy-cld) extension to [spaCy](https://spacy.io/). Due to the time requried I ran this code seperately and saved a copy of the dataframe with the identified language for later use. Some basic cleaning such as removing tweets that were duplicated or contained no text.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
import regex as re
import nltk

import spacy
from spacy_cld import LanguageDetector

Defining the text cleaning function

In [2]:
def clean_text(raw_text):
    # removing web links
    text = re.sub(r' https:[^\s]+', repl='', string=raw_text)
    text = re.sub(r' http:[^\s]+', repl='', string=text)
    text = re.sub(r'https:[^\s]+', repl='', string=text)
    text = re.sub(r'http:[^\s]+', repl='', string=text)
    # remove hashtags and @users
    text = re.sub(r'\B#\w\w+', repl='', string=text)
    text = re.sub(r'\B@\w\w+', repl='', string=text)
    # removing any html artifacts
    text = BeautifulSoup(text, 'lxml').get_text().lower() 
    # removing any undesireable punctuation
    return re.sub('[^0-9A-Za-z@]+', repl=' ', string=text)

**Testing Out the Language Detector**

In [3]:
# directly from the spacy_cld site
nlp = spacy.load('en')
language_detector = LanguageDetector()

nlp.add_pipe(language_detector)
doc = nlp('This is some English text.')

In [4]:
 doc._.languages

['en']

In [5]:
doc._.language_scores['en']

0.96

Defining the function we will pass into the dataframe.

In [6]:
def detect_language(text=None):
    if text:                                # filtering out any NaN values that are in the dataset
        doc=nlp(text)                       #creating a spacy document
        if doc._.languages:                 # filtering out unknown languages
            return doc._.languages[0]       # returning language from the output
        else:
            return 'unknown'                # returns for undetectable language
    else:
        return 'no_text'                    # returns no_text for NaN values

Testing

In [7]:
# russian
detect_language("Это тест")

'unknown'

In [8]:
# english
detect_language("This is some English text.")

'en'

In [9]:
# spanish
detect_language('esto es una prueba')

'es'

In [10]:
# null
detect_language()

'no_text'

**Modifying the Dataset**

Bringing in the dataset and doing basic cleaning

In [11]:
tweets = pd.read_csv("./datasets/NBC/tweets.csv")

In [12]:
# dropping duclicated tweets
tweets.duplicated(['tweet_id', 'user_key', 'created_at']).sum()

304

In [13]:
tweets.text.isnull().sum()

21

In [14]:
# dropping duplicates and tweets with no text
tweets.drop_duplicates(['tweet_id', 'user_key', 'created_at'], inplace=True)
tweets.dropna(axis = 0, subset=['text'], inplace=True)

In [15]:
tweets.shape

(203126, 16)

Cleaning the text

In [16]:
%%time
tweets['cleaned_text'] = tweets.text.map(clean_text)

  ' Beautiful Soup.' % markup)


CPU times: user 1min 23s, sys: 4.79 s, total: 1min 28s
Wall time: 1min 28s


Adding the lang feature

In [17]:
%%time
tweets['lang'] = tweets.cleaned_text.map(detect_language)

CPU times: user 2h 21min 29s, sys: 24min 13s, total: 2h 45min 42s
Wall time: 43min 36s


In [18]:
#tweets.to_csv('./datasets/tweets_with_lang.csv')