# Analyzing August 2016 Kenya English Twitter Data

Historical data that was crawled over the month of August 2016 by IST Pulse was utilized in this notebook to generate a topic model and draw themes in the language. We narrowed our focus on English data because it composed the majority. The `kenya_health_data_query.py` program was utilized as a wrapper to quickly query the data. The `es_data_processor.py` program was used to extract the fields from the JSON formatted data that are most necessary for linguistic, geospatial, and time series analyses. The `tweet_processor.py` program was utilized to preprocess the text data in preparation for the topic modeling task. The latest version separates hashtags into terms (best guess).

The Python package `gensim` was used to perform the Latent Dirichlet Allocation algorithm. Unlike in the April analysis, a single core LDA model was used, in order to allow for guaranteed reproducibility. This is much slower, so it is only worthwhile if reproducibility is necessary.

This analysis was re-processed to provide the ability to save/load models and data associated with each part of the process.

## Query Data from Elasticsearch (es)

In [4]:
from tf_data_query import DocumentGather

In [5]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [6]:
dg = DocumentGather()

In [52]:
#print number of tweets in the English language, tagged in Kenya, from the month of May
#print(ktg.get_n_items(begin='2016-08-01', end='2016-09-01', lang=None))
print(ktg.get_n_items(begin='2016-08-01', end='2016-09-01', lang='en'))

2018-02-20 08:49:09,819 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/ic-ke-health-darpa/_count [status:200 request:0.294s]


141895


In [53]:
#Estimated time of processing ~ 5 mins  

In [54]:
data = ktg.get_data(begin='2016-08-01', end='2016-09-01',lang='en') 


2018-02-20 08:49:12,345 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/ic-ke-health-darpa/_count [status:200 request:0.154s]
2018-02-20 08:49:13,644 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/ic-ke-health-darpa/_search?scroll=5m&size=1000 [status:200 request:1.254s]
2018-02-20 08:49:15,650 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.470s]
2018-02-20 08:49:18,476 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.459s]
2018-02-20 08:49:21,115 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.264s]
2018-02-20 08:49:24,670 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:3.166s]
2018-02-20 08:49:28,986

fraction complete: 0.07047464674583319


2018-02-20 08:49:44,128 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:3.591s]
2018-02-20 08:49:46,834 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.447s]
2018-02-20 08:49:50,383 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.834s]
2018-02-20 08:49:52,704 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.050s]
2018-02-20 08:49:58,551 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:5.574s]
2018-02-20 08:50:00,736 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.500s]
2018-02-20 08:50:02,783 : INFO : GET https://c

fraction complete: 0.14094929349166638


2018-02-20 08:50:10,570 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.914s]
2018-02-20 08:50:12,491 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.563s]
2018-02-20 08:50:14,166 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.435s]
2018-02-20 08:50:16,608 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.223s]
2018-02-20 08:50:18,857 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.225s]
2018-02-20 08:50:21,928 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.811s]
2018-02-20 08:50:23,912 : INFO : GET https://c

fraction complete: 0.21142394023749955


2018-02-20 08:50:34,641 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.499s]
2018-02-20 08:50:36,907 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.002s]
2018-02-20 08:50:38,590 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.471s]
2018-02-20 08:50:40,155 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.328s]
2018-02-20 08:50:42,841 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.460s]
2018-02-20 08:50:45,765 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.674s]
2018-02-20 08:50:48,134 : INFO : GET https://c

fraction complete: 0.28189858698333276


2018-02-20 08:50:56,109 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.126s]
2018-02-20 08:50:57,650 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.328s]
2018-02-20 08:50:59,238 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.335s]
2018-02-20 08:51:01,172 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.644s]
2018-02-20 08:51:03,042 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.591s]
2018-02-20 08:51:05,213 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.927s]
2018-02-20 08:51:08,226 : INFO : GET https://c

fraction complete: 0.35237323372916596


2018-02-20 08:51:22,279 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:3.263s]
2018-02-20 08:51:24,704 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.188s]
2018-02-20 08:51:27,770 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.801s]
2018-02-20 08:51:30,365 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.328s]
2018-02-20 08:51:32,467 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.854s]
2018-02-20 08:51:35,101 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.379s]
2018-02-20 08:51:37,797 : INFO : GET https://c

fraction complete: 0.4228478804749991


2018-02-20 08:51:45,114 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.231s]
2018-02-20 08:51:48,395 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.083s]
2018-02-20 08:51:49,897 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.215s]
2018-02-20 08:51:51,670 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.502s]
2018-02-20 08:51:53,303 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.370s]
2018-02-20 08:51:55,056 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.503s]
2018-02-20 08:51:57,503 : INFO : GET https://c

fraction complete: 0.4933225272208323


2018-02-20 08:52:05,027 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.426s]
2018-02-20 08:52:06,550 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.235s]
2018-02-20 08:52:08,076 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.259s]
2018-02-20 08:52:10,176 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.871s]
2018-02-20 08:52:12,184 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.757s]
2018-02-20 08:52:14,982 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.530s]
2018-02-20 08:52:19,418 : INFO : GET https://c

fraction complete: 0.5637971739666655


2018-02-20 08:52:26,561 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.256s]
2018-02-20 08:52:28,137 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.252s]
2018-02-20 08:52:29,413 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.013s]
2018-02-20 08:52:30,657 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:0.980s]
2018-02-20 08:52:32,389 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.478s]
2018-02-20 08:52:34,212 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.598s]
2018-02-20 08:52:35,669 : INFO : GET https://c

fraction complete: 0.6342718207124987


2018-02-20 08:52:44,913 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.939s]
2018-02-20 08:52:46,445 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.249s]
2018-02-20 08:52:47,985 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.310s]
2018-02-20 08:52:49,948 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.723s]
2018-02-20 08:52:51,185 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:0.993s]
2018-02-20 08:52:59,713 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.607s]
2018-02-20 08:53:01,725 : INFO : GET https://c

fraction complete: 0.7047464674583319


2018-02-20 08:53:09,840 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.203s]
2018-02-20 08:53:11,576 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.498s]
2018-02-20 08:53:14,908 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:3.108s]
2018-02-20 08:53:17,133 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.006s]
2018-02-20 08:53:19,888 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.539s]
2018-02-20 08:53:21,592 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.445s]
2018-02-20 08:53:23,044 : INFO : GET https://c

fraction complete: 0.7752211142041651


2018-02-20 08:53:31,438 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.785s]
2018-02-20 08:53:33,509 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.824s]
2018-02-20 08:53:35,828 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.073s]
2018-02-20 08:53:37,635 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.588s]
2018-02-20 08:53:38,975 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.097s]
2018-02-20 08:53:40,750 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.543s]
2018-02-20 08:53:43,376 : INFO : GET https://c

fraction complete: 0.8456957609499982


2018-02-20 08:53:55,869 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.264s]
2018-02-20 08:53:57,730 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.622s]
2018-02-20 08:53:59,083 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.083s]
2018-02-20 08:54:00,674 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.329s]
2018-02-20 08:54:01,916 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.002s]
2018-02-20 08:54:05,054 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.900s]
2018-02-20 08:54:06,313 : INFO : GET https://c

fraction complete: 0.9161704076958315


2018-02-20 08:54:13,536 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.467s]
2018-02-20 08:54:15,980 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.192s]
2018-02-20 08:54:17,789 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.552s]
2018-02-20 08:54:19,469 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.428s]
2018-02-20 08:54:21,548 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.826s]
2018-02-20 08:54:24,264 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.457s]
2018-02-20 08:54:26,856 : INFO : GET https://c

fraction complete: 0.9866450544416646


2018-02-20 08:54:35,053 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:1.336s]
2018-02-20 08:54:37,692 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:2.432s]
2018-02-20 08:54:37,905 : INFO : GET https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll?scroll=5m [status:200 request:0.027s]
2018-02-20 08:54:37,927 : INFO : DELETE https://c3b40d472cc1e8bc18ea4143fd81e66b.us-east-1.aws.found.io:9243/_search/scroll [status:200 request:0.021s]


## Extract Necessary Fields

In [55]:
from es_data_processor import ESDataProcessor

In [56]:
esdp = ESDataProcessor(data)

In [57]:
kenya_geo_df = esdp.format_df()

In [51]:
kenya_geo_df.head()

Unnamed: 0,date,lat,lon,text,tweet_id
0,2016-08-24T04:23:06+00:00,-0.067814,37.905184,@KeshyRouzie Please use *140# and follow prom...,47dd54dd1c3eb6cd50c15ae9723b74747dbfaa5b80fdb8...
1,2016-08-28T12:55:28+00:00,-1.274863,36.86362,@IntelligentPix pics &amp; music tell it all #...,2f7974554565a3b2e31558f2df2ab73cdd9d045ef9e654...
2,2016-08-25T20:25:51+00:00,-1.274863,36.86362,Sadly PSG is taking all the 6points! https://t...,47f68e2640a575a5a43b8c92deba89071bc0a2b5f492b5...
3,2016-08-31T16:23:01+00:00,-0.368762,35.936808,@s_nakhone @alawiabdul stabbed his bf i hear that,2f530d9439c28cfa4ce139e2056f4d26abb22bbef35770...
4,2016-08-20T01:48:48+00:00,-4.021603,39.699591,@De6rasse of Canada is the next big thing afte...,2ddc236289da2fad099babe240bc6fdda1a5afc1907e95...


## Clean Text Data

In [58]:
from tweet_processor import TweetProcessor

In [59]:
tp = TweetProcessor()

In [60]:
texts = list(kenya_geo_df.text)
cleaned_texts = []
for t in texts:
    cleaned_text = tp.clean_text(t)
    cleaned_texts.append(cleaned_text)

In [61]:
cleaned_texts[0]

['please', 'use', 'follow', 'prompts', 'caro']

In [62]:
sparse = tp.make_sparse(texts=cleaned_texts)

In [63]:
vecs = [tp.stem_text(word_list=text) for text in sparse]

In [64]:
strings = [tp.re_string(text_list=text).strip() for text in cleaned_texts]

In [65]:
strings[0]

'please use follow prompts caro'

In [66]:
#append the preprocessed text as a column to the dataframe to keep track of original tweets
kenya_geo_df['final_string'] = strings

In [67]:
kenya_geo_df.head(n=5)

Unnamed: 0,date,lat,lon,text,tweet_id,final_string
0,2016-08-24T04:23:06+00:00,-0.067814,37.905184,@KeshyRouzie Please use *140# and follow prom...,47dd54dd1c3eb6cd50c15ae9723b74747dbfaa5b80fdb8...,please use follow prompts caro
1,2016-08-28T12:55:28+00:00,-1.274863,36.86362,@IntelligentPix pics &amp; music tell it all #...,2f7974554565a3b2e31558f2df2ab73cdd9d045ef9e654...,pics music tell cheki feat sharama makadem
2,2016-08-25T20:25:51+00:00,-1.274863,36.86362,Sadly PSG is taking all the 6points! https://t...,47f68e2640a575a5a43b8c92deba89071bc0a2b5f492b5...,sadly psg taking points
3,2016-08-31T16:23:01+00:00,-0.368762,35.936808,@s_nakhone @alawiabdul stabbed his bf i hear that,2f530d9439c28cfa4ce139e2056f4d26abb22bbef35770...,stabbed bf hear
4,2016-08-20T01:48:48+00:00,-4.021603,39.699591,@De6rasse of Canada is the next big thing afte...,2ddc236289da2fad099babe240bc6fdda1a5afc1907e95...,canada next big thing rio


## Topic Modeling Analysis

In [68]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [69]:
kenya_geo_df.to_csv('~/repos/validate/data/model_persist/month01/082016_espull.csv')

In [70]:
from gensim import corpora

dictionary = corpora.Dictionary(vecs)

2018-02-20 09:11:02,379 : INFO : 'pattern' package not found; tag filters are not available for English
2018-02-20 09:11:02,384 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-02-20 09:11:02,519 : INFO : adding document #10000 to Dictionary(9608 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:02,650 : INFO : adding document #20000 to Dictionary(13171 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:02,782 : INFO : adding document #30000 to Dictionary(15501 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:02,911 : INFO : adding document #40000 to Dictionary(17125 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:03,044 : INFO : adding document #50000 to Dictionary(18512 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:03,177 : INFO : adding document #60000 to Dictionary(19702 unique tokens: ['caro', 'follow', 

In [21]:
corpus = [dictionary.doc2bow(text) for text in vecs]

In [41]:
print(len(corpus))
print(len(dictionary))

141895
23453


In [71]:
from gensim import corpora

dictionary = corpora.Dictionary(vecs)
dictionary.save('~/repos/validate/data/model_persist/month01/082016.dict')

2018-02-20 09:11:11,409 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-02-20 09:11:11,536 : INFO : adding document #10000 to Dictionary(9608 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:11,670 : INFO : adding document #20000 to Dictionary(13171 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:11,804 : INFO : adding document #30000 to Dictionary(15501 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:11,928 : INFO : adding document #40000 to Dictionary(17125 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:12,052 : INFO : adding document #50000 to Dictionary(18512 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:12,186 : INFO : adding document #60000 to Dictionary(19702 unique tokens: ['caro', 'follow', 'pleas', 'prompt', 'use']...)
2018-02-20 09:11:12,317 : INFO : adding document #70000 to Dictionary(2069

In [72]:
corpus = [dictionary.doc2bow(item) for item in vecs]
corpora.MmCorpus.serialize('~/repos/validate/data/model_persist/month01/082016.mm', corpus)

2018-02-20 09:11:19,885 : INFO : storing corpus in Matrix Market format to ~/repos/validate/data/model_persist/month01/082016.mm
2018-02-20 09:11:19,887 : INFO : saving sparse matrix to ~/repos/validate/data/model_persist/month01/082016.mm
2018-02-20 09:11:19,887 : INFO : PROGRESS: saving document #0
2018-02-20 09:11:19,907 : INFO : PROGRESS: saving document #1000
2018-02-20 09:11:19,926 : INFO : PROGRESS: saving document #2000
2018-02-20 09:11:19,948 : INFO : PROGRESS: saving document #3000
2018-02-20 09:11:19,971 : INFO : PROGRESS: saving document #4000
2018-02-20 09:11:19,994 : INFO : PROGRESS: saving document #5000
2018-02-20 09:11:20,018 : INFO : PROGRESS: saving document #6000
2018-02-20 09:11:20,041 : INFO : PROGRESS: saving document #7000
2018-02-20 09:11:20,064 : INFO : PROGRESS: saving document #8000
2018-02-20 09:11:20,113 : INFO : PROGRESS: saving document #9000
2018-02-20 09:11:20,134 : INFO : PROGRESS: saving document #10000
2018-02-20 09:11:20,171 : INFO : PROGRESS: savi

2018-02-20 09:11:22,373 : INFO : PROGRESS: saving document #121000
2018-02-20 09:11:22,393 : INFO : PROGRESS: saving document #122000
2018-02-20 09:11:22,415 : INFO : PROGRESS: saving document #123000
2018-02-20 09:11:22,434 : INFO : PROGRESS: saving document #124000
2018-02-20 09:11:22,457 : INFO : PROGRESS: saving document #125000
2018-02-20 09:11:22,478 : INFO : PROGRESS: saving document #126000
2018-02-20 09:11:22,499 : INFO : PROGRESS: saving document #127000
2018-02-20 09:11:22,518 : INFO : PROGRESS: saving document #128000
2018-02-20 09:11:22,538 : INFO : PROGRESS: saving document #129000
2018-02-20 09:11:22,563 : INFO : PROGRESS: saving document #130000
2018-02-20 09:11:22,583 : INFO : PROGRESS: saving document #131000
2018-02-20 09:11:22,604 : INFO : PROGRESS: saving document #132000
2018-02-20 09:11:22,622 : INFO : PROGRESS: saving document #133000
2018-02-20 09:11:22,642 : INFO : PROGRESS: saving document #134000
2018-02-20 09:11:22,662 : INFO : PROGRESS: saving document #13

In [43]:
#Save + pickle
dictionary.save('~/repos/validate/data/model_persist/month01/082016.dict')
corpora.MmCorpus.serialize('~/repos/validate/data/model_persist/month01/082016.mm', corpus)

2018-02-17 10:30:27,431 : INFO : saving Dictionary object under ~/repos/validate/data/model_persist/month01/082016.dict, separately None
2018-02-17 10:30:27,465 : INFO : saved ~/repos/validate/data/model_persist/month01/082016.dict
2018-02-17 10:30:27,468 : INFO : storing corpus in Matrix Market format to ~/repos/validate/data/model_persist/month01/082016.mm
2018-02-17 10:30:27,468 : INFO : saving sparse matrix to ~/repos/validate/data/model_persist/month01/082016.mm
2018-02-17 10:30:27,469 : INFO : PROGRESS: saving document #0
2018-02-17 10:30:27,488 : INFO : PROGRESS: saving document #1000
2018-02-17 10:30:27,513 : INFO : PROGRESS: saving document #2000
2018-02-17 10:30:27,542 : INFO : PROGRESS: saving document #3000
2018-02-17 10:30:27,573 : INFO : PROGRESS: saving document #4000
2018-02-17 10:30:27,601 : INFO : PROGRESS: saving document #5000
2018-02-17 10:30:27,625 : INFO : PROGRESS: saving document #6000
2018-02-17 10:30:27,647 : INFO : PROGRESS: saving document #7000
2018-02-17 

2018-02-17 10:30:29,817 : INFO : PROGRESS: saving document #117000
2018-02-17 10:30:29,836 : INFO : PROGRESS: saving document #118000
2018-02-17 10:30:29,855 : INFO : PROGRESS: saving document #119000
2018-02-17 10:30:29,874 : INFO : PROGRESS: saving document #120000
2018-02-17 10:30:29,893 : INFO : PROGRESS: saving document #121000
2018-02-17 10:30:29,911 : INFO : PROGRESS: saving document #122000
2018-02-17 10:30:29,932 : INFO : PROGRESS: saving document #123000
2018-02-17 10:30:29,954 : INFO : PROGRESS: saving document #124000
2018-02-17 10:30:29,974 : INFO : PROGRESS: saving document #125000
2018-02-17 10:30:29,993 : INFO : PROGRESS: saving document #126000
2018-02-17 10:30:30,012 : INFO : PROGRESS: saving document #127000
2018-02-17 10:30:30,030 : INFO : PROGRESS: saving document #128000
2018-02-17 10:30:30,048 : INFO : PROGRESS: saving document #129000
2018-02-17 10:30:30,067 : INFO : PROGRESS: saving document #130000
2018-02-17 10:30:30,088 : INFO : PROGRESS: saving document #13

In [28]:
def evaluate_graph(dictionary, corpus, texts, limit):
    """
    Function to display num_topics - LDA graph using c_v coherence
    
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    limit : topic limit
    
    Returns:
    -------
    lm_list : List of LDA topic models
    c_v : Coherence values corresponding to the LDA model with respective number of topics
    """
    c_v = []
    lm_list = []
    for num_topics in range(1, limit):
        lm = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        lm_list.append(lm)
        cm = CoherenceModel(model=lm, texts=texts, dictionary=dictionary, coherence='c_v')
        c_v.append(cm.get_coherence())
        
    # Show graph
    x = range(1, limit)
    plt.plot(x, c_v)
    plt.xlabel("num_topics")
    plt.ylabel("Coherence score")
    plt.legend(("c_v"), loc='best')
    plt.show()
    
    return lm_list, c_v

In [38]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-2.1.2-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.2MB)
[K    100% |████████████████████████████████| 13.2MB 117kB/s eta 0:00:01
[?25hCollecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.10.0-py2.py3-none-any.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib)
  Downloading pyparsing-2.2.0-py2.py3-none-any.whl (56kB)
[K    100% |████████████████████████████████| 61kB 3.7MB/s ta 0:00:01
Installing collected packages: cycler, pyparsing, matplotlib
Successfully installed cycler-0.10.0 matplotlib-2.1.2 pyparsing-2.2.0


In [39]:
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel


In [None]:
## Runs for about 20 mins

In [103]:
%timeit lmlist, c_v = evaluate_graph(dictionary=dictionary, corpus=corpus, texts=vecs, limit=10)

NameError: name 'evaluate_graph' is not defined

In [104]:
import gensim.models.ldamodel 
#ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=150, iterations=50, alpha='asymmetric')

2018-02-20 11:06:51,216 : INFO : using asymmetric alpha [0.031139033, 0.028788464, 0.02676786, 0.025012296, 0.023472836, 0.02211189, 0.020900112, 0.019814245, 0.018835641, 0.017949153, 0.017142355, 0.01640497, 0.015728405, 0.015105434, 0.014529934, 0.013996676, 0.013501173, 0.0130395545, 0.012608458, 0.012204954, 0.011826476, 0.011470766, 0.011135828, 0.010819895, 0.010521393, 0.010238921, 0.009971219, 0.009717159, 0.009475723, 0.0092459945, 0.009027141, 0.008818409, 0.008619112, 0.008428624, 0.008246372, 0.008071837, 0.007904536, 0.0077440296, 0.0075899116, 0.0074418085, 0.0072993743, 0.0071622906, 0.0070302603, 0.00690301, 0.0067802845, 0.0066618463, 0.006547475, 0.006436964, 0.006330122, 0.0062267683, 0.0061267363, 0.0060298666, 0.005936013, 0.0058450364, 0.0057568057, 0.0056711994, 0.0055881017, 0.005507404, 0.005429004, 0.005352805, 0.005278715, 0.0052066483, 0.0051365225, 0.0050682607, 0.0050017894, 0.0049370397, 0.0048739444, 0.0048124413, 0.004752471, 0.0046939775, 0.004636906,

KeyboardInterrupt: 

In [74]:
model.save('082016lda.model')

2018-02-20 09:25:44,040 : INFO : saving LdaState object under 082016lda.model.state, separately None
2018-02-20 09:25:44,132 : INFO : saved 082016lda.model.state
2018-02-20 09:25:44,210 : INFO : saving LdaModel object under 082016lda.model, separately ['expElogbeta', 'sstats']
2018-02-20 09:25:44,211 : INFO : storing np array 'expElogbeta' to 082016lda.model.expElogbeta.npy
2018-02-20 09:25:44,229 : INFO : not storing attribute dispatcher
2018-02-20 09:25:44,229 : INFO : not storing attribute id2word
2018-02-20 09:25:44,231 : INFO : not storing attribute state
2018-02-20 09:25:44,236 : INFO : saved 082016lda.model


In [None]:
for i in range(0, model.num_topics):
    print(str(i),':',model.print_topic(i))

## Get Top Topic for Each Tweet

In future it would probably be best to have it return the list of topics with their respective adherences for each tweet; for now it is just the topic most adherent to each tweet.

In [75]:
#assign topics to tweets
doc_top_scores = []
for i in range(len(cleaned_texts)):
    doc_top_scores.append(model.get_document_topics(bow=dictionary.doc2bow(cleaned_texts[i])))

In [76]:
topic_scores = {}

for i in range(len(doc_top_scores)):
    topic_scores[i] = {}
    topics = [topic[0] for topic in doc_top_scores[i]]
    scores = [topic[1] for topic in doc_top_scores[i]]
    for topic_n in range(500):
        
        if topic_n in topics:
            topic_scores[i][topic_n] = scores[topics.index(topic_n)]

In [77]:
import pandas as pd

top_Score_df = pd.DataFrame.from_dict(topic_scores)
top_Score_df = top_Score_df.fillna(0)
top_Score_df = top_Score_df.transpose()

In [78]:
top_Score_df['text'] = list(kenya_geo_df.text)
top_Score_df['processed_text'] = list(strings)

In [81]:
import numpy as np

In [82]:
maxes = [] 
for row in range(top_Score_df.shape[0]):
    topic_adherence = list(top_Score_df.iloc[row,:top_Score_df.shape[1] - 2])
    max_score = topic_adherence.index(np.max(topic_adherence))
    maxes.append(max_score)

In [83]:
top_Score_df['max_topic'] = maxes
kenya_geo_df['max_topic'] = maxes

In [84]:
top_Score_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142,144,145,146,147,148,149,text,processed_text,max_topic
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.250599,0.0,0.0,@KeshyRouzie Please use *140# and follow prom...,please use follow prompts caro,30
1,0.43302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,@IntelligentPix pics &amp; music tell it all #...,pics music tell cheki feat sharama makadem,0
2,0.01557,0.514394,0.013384,0.012506,0.011736,0.011056,0.01045,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Sadly PSG is taking all the 6points! https://t...,sadly psg taking points,1
3,0.01038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,@s_nakhone @alawiabdul stabbed his bf i hear that,stabbed bf hear,91
4,0.0,0.171465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,@De6rasse of Canada is the next big thing afte...,canada next big thing rio,1


In [56]:
print(kenya_geo_df.groupby('max_topic').count().sort_index(by=['text'],ascending=False)['text'].loc[[149]].sum())

  if __name__ == '__main__':


KeyError: 'None of [[149]] are in the [index]'

## Display Top 20 Tweets Per Topic

In [86]:
#remove duplicates so you get the most out of the top 20 tweets
kenya_tweet_df_no_dups = kenya_geo_df.drop_duplicates(subset='final_string')
print(kenya_tweet_df_no_dups.shape)

(97074, 7)


In [110]:
from IPython.display import display
from ipywidgets import widgets
from IPython.display import clear_output

text = widgets.Text()
display(text)

def handle_submit(sender):
    clear_output()
    print('Showing top 20 tweets in Topic',text.value)
    try:
        for t in kenya_tweet_df_no_dups.loc[kenya_tweet_df_no_dups.max_topic == int(text.value)].sample(frac=1)['text'][:70]:
            print(t)
            print()
    except KeyError:
        print('Invalid Topic Number (try anything from 0 to 199).')
    
text.on_submit(handle_submit)

Showing top 20 tweets in Topic 0
I've read interesting posts from @QulshTM and @IEAKwame on #traitor thing. The former was specifically illuminating. I will address that. 
3

"@MailSport: BREAKING: Roberto Martinez appointed Belgium manager. More to follow https://t.co/yTg2CiPrB3"

@benkonssojah shared @konshensnewgovz's post with you. See it at https://t.co/dI4nH1Oa71 mi real earthy dad

@jaytakeapic @kavi_is_me @KuiGitau @Nyaboe_ hae to you too LOOL

Listen in @nosimFm from 8pm today for an interactive session on launch of Suswa Lake Magadi ecosystem &amp;environmental restoration @MyGovKe

Met Saiid @javahouseafrica Agakhan walk A total stranger who gave me a tale on Pakistan and tea. Coffee isnt a drink it's a meet-up. A story

In 1498 he erected the Vasco da Gama Pillar in Malindi #TukutaneMSA2016 @MagicalKenya https://t.co/RoWJxLe0KL

@etaleJay different places. There is Chale island, Aberdares, Ngare Ndare forest. you should visit, it will be worth it

The RBI chief was, in part

In [2]:
kenya_tweet_df_no_dups[kenya_tweet_df_no_dups.text.str.contains('maternal')][['text', 'max_topic']]

NameError: name 'kenya_tweet_df_no_dups' is not defined

In [69]:
dictionary = corpora.Dictionary.load('~/repos/validate/data/model_persist/month01/082016.dict')
corpus = corpora.MmCorpus('~/repos/validate/data/model_persist/month01/082016.mm')
lda = LdaModel.load('082016lda')
#print dictionary
#print corpus
#print lda

2018-02-18 12:27:33,175 : INFO : loading Dictionary object from ~/repos/validate/data/model_persist/month01/082016.dict
2018-02-18 12:27:33,184 : INFO : loaded ~/repos/validate/data/model_persist/month01/082016.dict
2018-02-18 12:27:33,201 : INFO : loaded corpus index from ~/repos/validate/data/model_persist/month01/082016.mm.index
2018-02-18 12:27:33,201 : INFO : initializing corpus reader from ~/repos/validate/data/model_persist/month01/082016.mm
2018-02-18 12:27:33,203 : INFO : accepted corpus with 141895 documents, 23453 features, 998014 non-zero entries
2018-02-18 12:27:33,203 : INFO : loading LdaModel object from 082016lda
2018-02-18 12:27:33,206 : INFO : loading expElogbeta from 082016lda.expElogbeta.npy with mmap=None
2018-02-18 12:27:33,217 : INFO : setting ignored attribute dispatcher to None
2018-02-18 12:27:33,218 : INFO : setting ignored attribute id2word to None
2018-02-18 12:27:33,218 : INFO : setting ignored attribute state to None
2018-02-18 12:27:33,218 : INFO : loade

In [70]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [71]:
pyLDAvis.gensim.prepare(model, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [90]:
#### moving manually to data folder
import re

save_text = list(kenya_geo_df.text)
save_text = [re.sub('\\n|\n|,|\s|\t', ' ', str(save_text[i])) for i in range(len(save_text))]
kenya_geo_df.text = save_text


kenya_geo_df.to_csv('~/repos/validate/data/model_persist/month01/August 2016.csv')
top_Score_df.to_csv('~/repos/validate/data/model_persist/month01/august2016_extended.csv')

In [1]:
import pandas as pd

kenya_geo_df = pd.read_csv('~/repos/validate/data/model_persist/month01/August 2016.csv', encoding='iso-8859-1')

In [88]:
kenya_geo_df[kenya_geo_df.text.str.contains('health')][['text', 'max_topic']]

Unnamed: 0,text,max_topic
297,health and health care: need 2 improve data fo...,74
347,Today @TheHubKaren will be Promoting healthy ...,30
466,The day you will see these rich people seeking...,139
489,Heartbreaking:Man hacks off his wife's hands b...,36
988,He is really going strong on her health detail...,0
1269,"#TransformingTourismKE healthy ecosystems, he...",40
1652,"They have a mission to deliver 1st-class, subs...",5
1751,@mohammedhersi @KideroEvans @HassanAliJoho @Ja...,14
2168,@TakedaPharma office launch. In attendance Hon...,74
2545,So much to be thankful for..gift of life...lov...,24


In [44]:
[text for text in kenya_geo_df['text'] if 'health' in text.lower() ]

['#Health just started trending with 46097 tweets. More trends at https://t.co/6AyDQDZQ89 #trndnl',
 'health and health care: need 2 improve data for cross-country and in-country analysis to determine need- Dr.Othieno Nyanjom #EAinequalities',
 'Today @TheHubKaren will be  Promoting healthy eating and responsible food production methods.Visit them  #RightEats https://t.co/tlm8A3ZWaq',
 'The day you will see these rich people seeking treatment in Kenyan hospitals is when you will know healthcare system has improved.',
 "Heartbreaking:Man hacks off his wife's hands because she failed to conceive. Tests showed she was fertile &amp; healthy. https://t.co/732BkZYlMH",
 'He is really going strong on her health details  could she be sick?  https://t.co/S0BmUHFQkV',
 '#TransformingTourismKE healthy  ecosystems  healthy communities are key for growth of nature based tourism. #StateHouseSummit @Min_TourismKE',
 'They have a mission to deliver 1st-class  subsidized healthcare to every corner on e

In [3]:
[i for i in kenya_geo_df.loc[i,'text'] if 'health' in i.lower()]

NameError: name 'i' is not defined

In [None]:
# remove duplicates so you get the most out of the top 20 tweets
# kenya_tweet_df_no_dups = top_Score_df.drop_duplicates(subset='processed_text')
lda_save_path = "./saved-lda-model"
ldaModel.save(lda_save_path) 

#moving manually to data folder
kenya_geo_df.to_csv('kenya_data_full_all.csv', encoding='utf-8')  

In [None]:
oup = open("topic_summary.txt", "wb")
for x in topics_final:
    oup.write("%s\n" % (x))
oup.close()

sc.stop()

In [None]:
#Free up some memory 
clear()