## Data collection

1. I selected the top 50 popular actors and actresses according to imdb https://www.imdb.com/list/ls053501318/. BeautifulSoup was used to scrap the webpage and retrieve their names
2. The contents of the wikipages were downloaded using the wikipedia library in python. The contents were also saved into separate files.
3. To get a baseline of prevalent words related to social movements, seven wikipages were downloaded: pages on 'LGBT social movements',Environmental movement','Human rights movement','Anti-war movement','Animal rights movement','Black Lives Matter','Civil rights movement'

In [1]:
import wikipedia
from bs4 import BeautifulSoup
import requests


In [2]:
url = "https://www.imdb.com/list/ls053501318/"

In [3]:
r  = requests.get(url)

In [4]:
data = r.text
soup = BeautifulSoup(data)

In [5]:
names = []

In [6]:
#getting names of actors
for element in soup.find_all('h3'):
    try:
        name = element.find('a').contents[0].strip() 
        if len(name)>0:
            names.append(name)
    except:
        continue

In [7]:
print (names[:10])

['Johnny Depp', 'Al Pacino', 'Robert De Niro', 'Kevin Spacey', 'Denzel Washington', 'Russell Crowe', 'Brad Pitt', 'Angelina Jolie', 'Leonardo DiCaprio', 'Tom Cruise']


In [9]:
#downloading each person's wikipage using the wikipedia library
for name in names:
    wikipage = wikipedia.page(name)
    if len(wikipage.content)>0:
        fn = name.lower().replace(" ","_")
        with open(f'{fn}.txt','w') as f: #saving each person's wikipage into separate files
            f.write(wikipage.content)

In [10]:
activist_pages = ['LGBT social movements','Environmental movement','Human rights movement',
                 'Anti-war movement','Animal rights movement','Black Lives Matter',
                 'Civil rights movement']

In [11]:
activist_txt = []

In [12]:
for p in activist_pages:
    activist_txt.append(wikipedia.page(p).content) 

In [13]:
len(activist_txt)

7

## Text cleaning and preprocessing

In [14]:
import re
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
#nltk.download('wordnet') 
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
import re
import pandas as pd
import numpy as np

In [15]:
stop_words = set(stopwords.words("english"))

In [16]:
stop_words = stop_words.union(set(['people','child','parent','study','also','men','group','late'
                                  'sometimes']))

In [17]:
def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)

    #Convert to lowercase
    text = text.lower()

    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)

    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)

    ##Convert to list from string
    text = text.split()

    ##Stemming
    ps=PorterStemmer()
    #Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    return (text)

In [18]:
activist_txt = [clean_text(i) for i in activist_txt]

In [23]:
paragraphs = []
raw_paragraphs = []
for name in names:
    fn = name.lower().replace(" ","_")
    with open(f'{fn}.txt','r') as f:
        content = f.read().split("\n\n\n") ## break down wiki pages into smaller articles
        raw_content = [i for i in content if len(i.split(" "))>10]
        content = [clean_text(i) for i in content if len(i.split(" "))>10]
        raw_paragraphs.extend(raw_content)
        paragraphs.extend(content)
#         content = f.read()
#         paragraphs.append(clean_text(content))

In [25]:
len(paragraphs)

698

## Getting seed words for activists

In [64]:
#Most frequently occuring words
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(max_df=0.8,stop_words=stop_words, max_features=200, 
                    ngram_range=(1,3),min_df = 0.3).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in      
                   vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                       reverse=True)
    return words_freq[:n]

#Convert most freq words to dataframe for plotting bar plot
top_words = get_top_n_words(activist_txt, n=20)
top_df = pd.DataFrame(top_words)
top_df.columns=["Word", "Freq"]

In [65]:
activist_top_words = [i[0] for i in top_words]

In [66]:
activist_top_words

['black',
 'white',
 'matter',
 'civil right',
 'police',
 'human right',
 'african american',
 'law',
 'king',
 'march',
 'community',
 'freedom',
 'city',
 'local',
 'officer',
 'president',
 'racial',
 'day',
 'civil right movement',
 'liberation']

## Guided LDA: using seed words to define topic, and retrieve similar documents in the topic

In [67]:
import guidedlda


In [68]:
vec = CountVectorizer(max_df=0.8,stop_words=stop_words, 
                   ngram_range=(1,3),min_df = 0.01)
X = np.array(vec.fit_transform(paragraphs).todense())


In [69]:
word2id = vec.vocabulary_

In [70]:
vocab = tuple(word2id.keys())

In [71]:
seed_topic_list = [activist_top_words]

In [73]:
model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)

In [74]:
seed_topics = {}
for t_id, st in enumerate(seed_topic_list):
    for word in st:
        if word in word2id.keys():
#             print(word)
            seed_topics[word2id[word]] = t_id
model.fit(X, seed_topics=seed_topics, seed_confidence=1)

INFO:guidedlda:n_documents: 698
INFO:guidedlda:vocab_size: 3464
INFO:guidedlda:n_words: 109849
INFO:guidedlda:n_topics: 5
INFO:guidedlda:n_iter: 100
INFO:guidedlda:<0> log likelihood: -1072245
INFO:guidedlda:<20> log likelihood: -869446
INFO:guidedlda:<40> log likelihood: -856846
INFO:guidedlda:<60> log likelihood: -851806
INFO:guidedlda:<80> log likelihood: -849698
INFO:guidedlda:<99> log likelihood: -847731


<guidedlda.guidedlda.GuidedLDA at 0x1a1ffc5780>

In [75]:
model.doc_topic_[:,0].argsort()[::-1][:10]

array([199, 387, 528, 198, 258, 266, 264, 642, 262, 499])

## Examples of paragraphs (articles) that are related to social/political progressive issues

In [76]:
[raw_paragraphs[i] for i in model.doc_topic_[:,0].argsort()[::-1][:10]]

["=== 2016 Presidential election ===\nFor the 2016 Republican Party presidential primaries, Schwarzenegger endorsed fellow Republican John Kasich. However, he announced in October that he would not vote for the Republican presidential candidate Donald Trump in that year's United States presidential election, with this being the first time he did not vote for the Republican candidate since becoming a citizen in 1983.",
 '=== Political opinions ===\nConnery is a member of the Scottish National Party (SNP), a centre-left political party campaigning for Scottish independence from the United Kingdom, and has supported the party financially and through personal appearances. His funding of the SNP ceased in 2001, when the UK Parliament passed legislation that prohibited overseas funding of political activities in the UK.',
 '=== Religious views ===\nDuring a 1992 Vanity Fair interview, Nicholson stated, "I don\'t believe in God now. I can still work up an envy for someone who has faith. I can