In [11]:
from collections import Counter
from urllib import request

## Introduction

Sources of textual data:

- Wikipedia
- Twitter
- Facebook
- News websites
- .... and many many more

Why text analytics?

- Machine translation
- Contextual disambiguation
- Extract information
- Classification
- Document retrieval
- Sentiment analysis
- Topic modelling

- Classical data analysis: structured data, e.g. age, location, income
- Unstructured data

## Example 1 (lexicon approach)

We begin with a small example that to showcase some of the problems associated with unstructured textual data.
In the following we read file `ballet.txt` from github storage and decode it. The file contains  The goal in this example is to classify
each sentence into four categories depending on the gender mentioned in each sentence: "male", "female", "mixed" or "unknown".

In [2]:
url_handle = request.urlopen("https://raw.githubusercontent.com/boyko/text-analytics-script/main/data/ballet.txt")
articles = url_handle.read().decode("utf-8")
print(articles[:300])

With apologies to James Brown, the hardest working people in show business may well be ballet dancers. And at New York City Ballet, none work harder than the dancers in its lowest rank, the corps de ballet. During the first week of the company’s winter season, Claire Kretzschmar, 24, a rising corps 


First we divide the articles into individual sentences by splitting the string by ".".

In [3]:
articles_sent = articles.split(".")

print(f"The articles consist of {len(articles_sent)} sentences.")

The articles consist of 167 sentences.


In [4]:
MALE_WORDS = set([
    'guy', 'spokesman', 'chairman', "men's", 'men', 'him', "he's", 'his',
    'boy', 'boyfriend', 'boyfriends', 'boys', 'brother', 'brothers', 'dad',
    'dads', 'dude', 'father', 'fathers', 'fiance', 'gentleman', 'gentlemen',
    'god', 'grandfather', 'grandpa', 'grandson', 'groom', 'he', 'himself',
    'husband', 'husbands', 'king', 'male', 'man', 'mr', 'nephew', 'nephews',
    'priest', 'prince', 'son', 'sons', 'uncle', 'uncles', 'waiter', 'widower',
    'widowers'
])

FEMALE_WORDS = set([
    'heroine', 'spokeswoman', 'chairwoman', "women's", 'actress', 'women',
    "she's", 'her', 'aunt', 'aunts', 'bride', 'daughter', 'daughters', 'female',
    'fiancee', 'girl', 'girlfriend', 'girlfriends', 'girls', 'goddess',
    'granddaughter', 'grandma', 'grandmother', 'herself', 'ladies', 'lady',
    'mom', 'moms', 'mother', 'mothers', 'mrs', 'ms', 'niece', 'nieces',
    'priestess', 'princess', 'queens', 'she', 'sister', 'sisters', 'waitress',
    'widow', 'widows', 'wife', 'wives', 'woman'
])

We define a simple function to classify each sentence, then apply it on each sentence in the `articles_sent` list.

In [8]:
def classify_sentence_gender(sentence):
    words = sentence.split(' ')
    female_words = len(FEMALE_WORDS.intersection(words))
    male_words =  len(MALE_WORDS.intersection(words))

    if male_words > 0 and female_words == 0:
        return 'male'
    if male_words == 0 and female_words > 0:
        return 'female'
    if male_words == 0 and female_words == 0:
        return 'unknown'

    return 'mixed'

categories = list(map(classify_sentence_gender, articles_sent))

In [9]:
print(categories)

['unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'male', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'female', 'female', 'female', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'female', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'female', 'unknown', 'unknown', 'unknown', 'female', 'unknown', 'mixed', 'unknown', 'unknown', 'unknown', 'unkn

In [10]:
counter = Counter(categories)
print(counter)

Counter({'unknown': 120, 'female': 42, 'male': 4, 'mixed': 1})


## Example 2: Bag of words approach
