# Twitter Opinion Mining


### Import modules and define filenames and directories for current job
All the parameters are saved in a dictionary named `job`
that will be passed to the different modules
and that can be saved to reproduce the results.
Optional parameters can be added to fine tune
process. Each optional parameter is explained at the corresponding step below.

In [3]:
# import the different modules
from buildDatabase import buildDatabse
from makeHTnetwork import makeHTNetwork
from selectInitialHashtags import selectInitialHashtags
from propagateLabels import propagateLabels
from addStatSigniHT import addStatSigniHT
from selectHashtags import selectHashtags
from updateHTGroups import updateHTGroups
from buildTrainingSet import buildTrainingSet
from crossValOptimize import crossValOptimize
from trainClassifier import trainClassifier
from classifyTweets import classifyTweets
from makeProbaDF import makeProbaDF
from analyzeProbaDF import analyzeProbaDF

# create empty dictionary and add parameters
job = {}
# list of directories containing the tweet archive files (TAJ)
job['tweet_archive_dirs'] = ['etrade']

# SQLite database that will be created
job['sqlite_db_filename'] = 'test.sqlite'

# hashtag co-occurrence graph that will be created
job['graph_file'] = 'graph_file.graphml'

# pickle files where the training set features will be saved
job['features_pickle_file'] = 'features.pickle'

# pickle file where the training set labels will be saved
job['labels_pickle_file'] = 'labels.pickle'

# vectorized features file
job['features_vect_file'] = 'features.mmap'

# vectorized labels file
job['labels_vect_file'] = 'labels.mmap'

# mapping between labels names and numbers
job['labels_mappers_file'] = 'labels_mappers.pickle'

# JSON file with the classifier best parameters obtained from cross-validation
job['best_params_file'] = 'best_params.json'

# where the trained calssifier will be saved
job['classifier_filename'] = 'classifier.pickle'

# DataFrame with the results of the label propagation
# on the hashtag network
job['propag_results_filename'] = 'propag_results.pickle'

# DataFrame with the classification probability of
# every tweets in the database
job['df_proba_filename'] = 'df_proba.pickle'

# DataFrame with the number of tweets in each camp per day
job['df_num_tweets_filename'] = 'df_num_tweets.pickle'

# DataFrame with the number of users in each camp per day
job['df_num_users_filename'] = 'df_num_users.pickle'

### 1. Build the SQLite database with the extracted info from the tweets
Read the tweets from all the .taj files in the directories `tweet_archive_dirs`
and add them to the database `sqlite_db_filename`.

In [4]:
buildDatabse(job).run()

0 over 1
... getting data from etrade/tweets-b15b7e5b-a99f-4612-a24d-c452dbc0b9fb.taj
... took 2.637s

*** updating sqlite tables...
*** took 0.5434s

Finished
Total time 3.18153s
Transaction time 3.2005s
Total time 3.21025s
sqlite_file : test.sqlite
Creating indexes
time 0.220614s


### 2.  Make the Hashtag co-occurrences network
Reads all the co-occurences from the SQLite database and builds the network
of where nodes are hashtags and edges are co-occurrences.
The graph is a [*graph-tool*](https://graph-tool.skewed.de/) object and is saved in graphml format to `graph_file`.

Nodes of the graph have two properties: `counts` is the number of single occurrences of the hashtag and `name` is the name of the hashtag.

Edges have a property `weights` equal to the number of co-occurrences they represent.

The graph has the following properties saved with it:
- `Ntweets`: number of tweets with at least one hashtag used to build the graph.
- `start_date` : date of the first tweet.
- `stop_date` : date of the last tweet.
- `weight_threshold` : co-occurrence threshold. Edges with less than `weight_threshold` co-occurrences are discarded.

*Optional parameters that can be added to `job`:*
- `start_date` and `stop_date` to specify a time range for the tweets. (Default is `None`, i.e. select all the tweets in the database).
- `weight_threshold` is the minimum number of co-occurences between to hashtag to be included in the graph. (Default is 3).

To add a parameter to job, simply execute `job["parameter name"] = parameter value`.


In [5]:
makeHTNetwork(job).run()

creating edge list
*** took 1.968s
creating graph
*** took 0.005866s


### 3. Add statistical significance value to edges
Adds a property `s` to edges of the graph corresponding to the statistical significance (`s = log10(p_0/p)`)
of the co-occurence computed from a null model [1].
The computation is done using `p0=1e-6` and `p0` is saved as a graph property.
Different values of `p0` can be tested latter.
The resulting graph is saved to `graph_file`.

*Optional parameters that can be added to `job`:*
- `ncpu` : number of processors to be used. (Default is the number of cores on your machine minus 1).


[1] Martinez-Romo, J. et al. Disentangling categorical relationships through a graph of co-occurrences. Phys. Rev. E 84, 1–8 (2011).


In [6]:
addStatSigniHT(job).run()

computing significance of links


[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    9.7s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:   44.5s


finished
*** took 57.88s


[Parallel(n_jobs=3)]: Done 351 out of 351 | elapsed:   57.7s finished


### 4. Select the initial hashtags to start the propagation
This will display to top occurring hashtags.

*Optional parameters that can be added to `job`:*
- `num_top_htgs` : (Default is top 100).

In [7]:
selectInitialHashtags(job).run()

 Top 100 occuring hashtags:
* rank: (name: frequency)
0 :('finance', 8162)
1 :('etrade', 8028)
2 :('tradeking', 7663)
3 :('money', 7436)
4 :('stock', 4169)
5 :('401k', 4141)
6 :('alerts', 4141)
7 :('amtd', 4141)
8 :('stocks', 3937)
9 :('stockmarket', 3915)
10 :('cash', 3911)
11 :('market', 3523)
12 :('ameritrade', 3522)
13 :('scottrade', 3522)
14 :('mortgage', 714)
15 :('rates', 352)
16 :('didyouknow', 210)
17 :('loan', 94)
18 :('loans', 76)
19 :('history', 70)
20 :('interest', 70)
21 :('seattle', 70)
22 :('canadian', 68)
23 :('kia', 58)
24 :('motors', 58)
25 :('hiring', 49)
26 :('year', 47)
27 :('news', 45)
28 :('house', 45)
29 :('a', 44)
30 :('get', 44)
31 :('major', 40)
32 :('rate', 39)
33 :('refinancing', 38)
34 :('boycottcnn', 36)
35 :('calculator', 35)
36 :('lenders', 35)
37 :('daytrading', 32)
38 :('stocktrader', 31)
39 :('vermont', 30)
40 :('trump', 30)
41 :('2nd', 29)
42 :('mortgages', 29)
43 :('estimate', 29)
44 :('calcu', 27)
45 :('ecommerce', 26)
46 :('jobsearch', 26)
47 :(

Select seeds hashtags you want to use from the list (minimum two) 
and add them to the `job` dictionary with the key `initial_htgs_lists`:

In [8]:
# initial_htgs_lists is a list of list with hashtags seeds for each camp:
job['initial_htgs_lists'] = [['money'],
                             ['401k']]

### 5. Propagate labels to neighboring hashtags
This part can be looped by updating the `htgs_lists` in `job` with the result of the label propagation to reach a larger number of hashtags.

In [9]:
# start with the hashtag seeds selected above.
job['htgs_lists'] = job['initial_htgs_lists']

The loop has two steps:
1. `propagateLabels` uses the graph from `graph_file` and the initial hashtags from `htgs_lists` to propagate their labels to their neighbors taking into account the statistical significance of edges. The results are saved in a pandas DataFrame in `propag_results_filename`.
    - *Optional parameters that can be added to `job`:*
        - `count_ratio` : threshold, $r$, for removing hashtags with a number of single occurrences smaller than $r \max\limits_{v_j\in C_k} c_j$ where $c_i$ is the number of occurrences of the hashtag associated with vertex $v_i$, $C_k$ is the class to which $v_i$ belong. (Default = 0.001).
        - `p0` : significance threshold. to keep only edges with p_val <= p0. (Default = 1e-5).

2. Visualisation of the results using `selectHashtags`, and updating the `htgs_lists` list. This will print a list of hashtags, $i$, for each camp $C_k$ satisfying: $\sum_{j \in C_k} s_{ij} > \sum_{j \in C_l} s_{ij}$, where $C_l$ represents all the other camps than $C_k$.
    - *Optional parameters that can be added to `job`:*
        - `num_top_htgs` : number of top hashtags to be displayed in each camp. (Default is 100).

In [18]:
# 1st step of the loop:
propagateLabels(job).run()

Propagating labels
saving results


In [19]:
# 2nd step of the loop:
selectHashtags(job).run()
# the signification of the displayed columns are:
# count (= total number of occurrences),
# label_init (= initial label before propagation, -1 means no initial labels)
# vertex_id  (= ID of the vertex in the hashtag graph)
# label_sum1 (= number of neighbors with label 1)
# signi_sum1 (= sum of the significance of edges with neighbors having label 1)
# label_sum2 (= number of neighbors with label 2)
# signi_sum2 (= sum of the significance of edges with neighbors having label 2)


 +++ hashtags in camp 2
            name  count  label_init  vertex_id  label_sum2    signi_sum2  \
26       finance   8162           2         26         4.0   4527.106262   
1         etrade   8028           2          1         5.0   8079.589318   
81     tradeking   7663           2         81         5.0   8517.992058   
24         stock   4169           2         24         6.0  14079.147091   
98          401k   4141           2         98         6.0  14382.367984   
99        alerts   4141           2         99         6.0  14382.367984   
100         amtd   4141           2        100         6.0  14382.367984   
76           kia     58          -1         76         1.0      7.655783   
77        motors     58          -1         77         1.0      7.655783   
27         major     40          -1         27         1.0      3.720529   
0     daytrading     32          -1          0         1.0      0.846667   
2    stocktrader     31          -1          2         1.0     

In [14]:
# you can now update the hashtag list and return the 1st step.
job['htgs_lists'] = [['money', 'stocks', 'stockmarket', 'cash', 'market', 'ameritrade', 'scottrade'],
               ['401k', 'finance', 'etrade', 'tradeking', 'stock', 'amtd', 'alerts']]

### 6.  Mark the selected hashtags in the database and build the training set
`updateHTGroups` takes the lists of hashtags `htgs_lists` and mark then in the database `sqlite_db_filename`.

*Optional parameters that can be added to `job`:*
- `column_name_ht_group` : name of the column added to the database (Default is `'ht_class'`). Different names can be used to test different `htgs_list`.

In [20]:
updateHTGroups(job).run()

*** took 0.1708s


`buildTrainingSet` reads tweets from the database with hashtags marked above, extract the features and labels of each tweets and saves them in `features_pickle_file` and `labels_pickle_file`, respectively.
Vectorized versions of the features and labels are saved to `features_vect_file` and `labels_vect_file` for the cross-validation. A mapper between label names and label number is saved to `labels_mappers_file`.

*Optional parameters:*

- If the optional parameter `column_name_ht_group` has been changed in `job` in the step before, it will be used here to select the corresponding hashtag lists.
- `undersample_maj_class` : whether to undersample the majority class in order to balance the training set. Default is True, if False, unbalanced training set will be used and [class weight](http://scikit-learn.org/0.18/modules/generated/sklearn.linear_model.SGDClassifier.html) will be adjusted accrodingly during training.

In [21]:
buildTrainingSet(job).run()

Num tweets 1: 5
Num tweets 2: 80
Balancing sets by undersampling the majority class

Vectorizing features
*** took 0.0009975s
Num samples x Num features
(10, 284)


### 7. Cross-Validation
Optimize classifier parameters with cross-validation. `crossValOptimize` loads the vectorized features and labels (`features_vect_file` and `labels_vect_file`) and saves the results of the optimization to `best_params_file` in JSON format.

*Optional parameters:*
- if `undersample_maj_class` was set to `False` when building the training set, class weights will be adjusted to take into account different sizes of classes.
- `ncpu` : number of cores to use (default is the number of cpus on your machine minus one).
- `scoring` : The score used to optimize (default is `'f1_micro'`). See the [documentation](http://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.GridSearchCV.html) for explanation and other possibilities. 
- `n_splits` : number of [folds](http://scikit-learn.org/0.18/modules/generated/sklearn.model_selection.KFold.html) (default is 10).
- `loss` : [loss function](scikit-learn.org/0.18/modules/generated/sklearn.linear_model.SGDClassifier.html) to be used. Default is `'log'` for Logistic Regression.
- `penalty` : [penalty](scikit-learn.org/0.18/modules/generated/sklearn.linear_model.SGDClassifier.html) of the regularization term (default is `'l2`).
- `n_iter` : [number of iterations](scikit-learn.org/0.18/modules/generated/sklearn.linear_model.SGDClassifier.html) of the gradient descent algorithm. Default is `5e5/(number of training samples)`. See the sklearn Stochastic Gradient Descent [user guide](http://scikit-learn.org/0.18/modules/sgd.html#sgd) for recommended settings.
- `grid_search_parameters` : parameter space to explore during the cross-validation. Default is `{'classifier__alpha' : np.logspace(-1,-7, num=20)}`, i.e. optimizing the [regularization strength](http://scikit-learn.org/0.18/modules/sgd.html#sgd) (`alpha`) between 1e-1 and 1e-7 with 20 logarithmic steps.
- `verbose` : verbosity level of the calssifier (default is 1).

In [22]:
# here we set n_iter=2 just for testing purposes
job['n_iter'] = 2
crossValOptimize(job).run()


Performing grid search...
pipeline: ['classifier']
parameters:
{'classifier__alpha': array([  1.00000000e-01,   4.83293024e-02,   2.33572147e-02,
         1.12883789e-02,   5.45559478e-03,   2.63665090e-03,
         1.27427499e-03,   6.15848211e-04,   2.97635144e-04,
         1.43844989e-04,   6.95192796e-05,   3.35981829e-05,
         1.62377674e-05,   7.84759970e-06,   3.79269019e-06,
         1.83298071e-06,   8.85866790e-07,   4.28133240e-07,
         2.06913808e-07,   1.00000000e-07])}
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 6.14, NNZs: 265, Bias: -0.002968, T: 9, Avg. loss: 0.666463
Norm: 7.11, NNZs: 259, Bias: 0.001452, T: 9, Avg. loss: 0.966663
Total training time: 0.01 seconds.
Norm: 5.89, NNZs: 243, Bias: -0.003480, T: 9, Avg. loss: 0.658272
Total training time: 0.01 seconds.
-- Epoch 2
Total training time: 0.01 seconds.
-- Epoch 2
-- Epoch 2
Norm: 4.29, NNZs: 259, Bias: 0.001467, T: 18, Avg. loss: 0.484026
Norm: 3.72, NNZs: 265, Bias: -0.003215, T: 18, Avg. loss: 0.336048
To

Total training time: 0.01 seconds.
Norm: 9.55, NNZs: 250, Bias: -0.001987, T: 9, Avg. loss: 0.831998
Total training time: 0.01 seconds.
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 18.34, NNZs: 262, Bias: -0.029452, T: 18, Avg. loss: 0.525675
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 6.36, NNZs: 250, Bias: -0.002863, T: 18, Avg. loss: 0.421357
Total training time: 0.02 seconds.
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 18.23, NNZs: 242, Bias: 0.012647, T: 9, Avg. loss: 0.575099
Total training time: 0.01 seconds.
Norm: 9.54, NNZs: 267, Bias: 0.003052, T: 9, Avg. loss: 0.680207
-- Epoch 2
Norm: 26.99, NNZs: 262, Bias: -0.049064, T: 9, Avg. loss: 1.031940
Norm: 14.66, NNZs: 242, Bias: 0.012444, T: 18, Avg. loss: 0.288121
Total training time: 0.02 seconds.
-- Epoch 1
Total training time: 0.03 seconds.
Norm: 18.24, NNZs: 254, Bias: 0.002775, T: 9, Avg. loss: 0.584686
Total training time: 0.01 seconds.
Total training time: 0.00 seconds.
-- Epoch 2
-- Epoch 2
Norm: 23.36, NNZs: 

-- Epoch 2
-- Epoch 1
Norm: 40.82, NNZs: 254, Bias: 0.009505, T: 18, Avg. loss: 0.468781
-- Epoch 1
Norm: 48.92, NNZs: 254, Bias: -0.062876, T: 9, Avg. loss: 0.693534
Total training time: 0.00 seconds.
Total training time: 0.03 seconds.
-- Epoch 2
Norm: 66.75, NNZs: 262, Bias: 0.110987, T: 9, Avg. loss: 0.821840
-- Epoch 1
Norm: 47.31, NNZs: 254, Bias: -0.062754, T: 18, Avg. loss: 0.346892
Total training time: 0.01 seconds.
Norm: 31.16, NNZs: 264, Bias: 0.032668, T: 9, Avg. loss: 0.455659
Total training time: 0.00 seconds.
Total training time: 0.01 seconds.
-- Epoch 1
-- Epoch 2
Norm: 56.55, NNZs: 264, Bias: -0.026204, T: 9, Avg. loss: 3.019843
Norm: 37.38, NNZs: 264, Bias: -0.015110, T: 18, Avg. loss: 0.555482
Total training time: 0.00 seconds.
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 54.68, NNZs: 264, Bias: -0.026196, T: 18, Avg. loss: 1.509929
-- Epoch 2
Norm: 68.67, NNZs: 262, Bias: 0.067387, T: 18, Avg. loss: 0.460914
-- Epoch 1
Total training time: 0.01 seconds.
Norm: 

Norm: 130.12, NNZs: 259, Bias: -0.128563, T: 9, Avg. loss: 7.455268
-- Epoch 2
Norm: 97.30, NNZs: 262, Bias: 0.039543, T: 18, Avg. loss: 0.253349
Total training time: 0.00 seconds.
Total training time: 0.02 seconds.
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 129.61, NNZs: 259, Bias: -0.128469, T: 18, Avg. loss: 3.727772
-- Epoch 2
-- Epoch 1
Total training time: 0.02 seconds.
Norm: 124.08, NNZs: 243, Bias: 0.000869, T: 18, Avg. loss: 3.721101
Norm: 71.64, NNZs: 242, Bias: 0.131951, T: 9, Avg. loss: 0.311336
Total training time: 0.01 seconds.
-- Epoch 1
Total training time: 0.01 seconds.
Norm: 122.12, NNZs: 265, Bias: -0.065038, T: 9, Avg. loss: 1.735200
-- Epoch 1
Total training time: 0.01 seconds.
Norm: 80.89, NNZs: 264, Bias: 0.001886, T: 9, Avg. loss: 0.918565
-- Epoch 2
Total training time: 0.00 seconds.
-- Epoch 2
-- Epoch 2
Norm: 80.35, NNZs: 264, Bias: 0.001161, T: 18, Avg. loss: 0.459660
Norm: 96.62, NNZs: 242, Bias: 0.001469, T: 18, Avg. loss: 0.497339
Norm: 148.82, N

-- Epoch 1
Total training time: 0.01 seconds.
Norm: 213.84, NNZs: 267, Bias: 0.188951, T: 18, Avg. loss: 3.287216
Norm: 274.32, NNZs: 254, Bias: -0.033392, T: 9, Avg. loss: 6.509579
Total training time: 0.02 seconds.
Total training time: 0.01 seconds.
Total training time: 0.02 seconds.
-- Epoch 1
-- Epoch 2
Norm: 224.16, NNZs: 262, Bias: -0.434443, T: 9, Avg. loss: 4.249931
Norm: 289.59, NNZs: 254, Bias: -0.217270, T: 18, Avg. loss: 3.347748
-- Epoch 2
Norm: 287.86, NNZs: 265, Bias: -0.220030, T: 18, Avg. loss: 6.359521
Total training time: 0.00 seconds.
Total training time: 0.02 seconds.
-- Epoch 2
-- Epoch 1
Norm: 285.86, NNZs: 262, Bias: -0.162830, T: 18, Avg. loss: 3.100556
Norm: 187.66, NNZs: 264, Bias: -0.006380, T: 9, Avg. loss: 5.148465
Total training time: 0.05 seconds.
Total training time: 0.01 seconds.
-- Epoch 1
-- Epoch 1
Norm: 270.93, NNZs: 242, Bias: 0.162772, T: 9, Avg. loss: 4.802053
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 222.32, NNZs: 243, Bias: 0.024092,

Total training time: 0.01 seconds.
Norm: 524.80, NNZs: 265, Bias: 0.210280, T: 9, Avg. loss: 11.243918
-- Epoch 1
Total training time: 0.01 seconds.
-- Epoch 2
-- Epoch 1
Norm: 288.46, NNZs: 243, Bias: -0.299863, T: 9, Avg. loss: 1.202031
Norm: 557.49, NNZs: 265, Bias: -0.258531, T: 18, Avg. loss: 11.993016
Total training time: 0.01 seconds.
Total training time: 0.02 seconds.
-- Epoch 2
Norm: 294.60, NNZs: 254, Bias: 0.260252, T: 9, Avg. loss: 2.871493
Total training time: 0.01 seconds.
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 359.81, NNZs: 243, Bias: 0.091000, T: 18, Avg. loss: 3.550281
Norm: 294.57, NNZs: 254, Bias: 0.260252, T: 18, Avg. loss: 1.435747
-- Epoch 1
Total training time: 0.01 seconds.
-- Epoch 1
Norm: 342.85, NNZs: 264, Bias: -0.314069, T: 9, Avg. loss: 2.807212
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 342.79, NNZs: 264, Bias: -0.314069, T: 18, Avg. loss: 1.403606
Total training time: 0.01 seconds.
-- Epoch 1
Norm: 426.55, NNZs: 243, Bias: -0.600112

### 8. Train Classifier
Uses features and labels from `features_pickle_file` and `labels_pickle_file` to train the classifier using the parameters from `best_params_file`. The trained classifier is then saved to `classifier_filename`.

In [23]:
trainClassifier(job).run()


Training classifier with the following parameters:
    loss         = log
    alpha        = 0.0054555948000000005
    n_iter       = 2
    penalty      = l2
    class_weight = None

fitting classifier
*** took 0.00315s


### 9. Classify the tweets
Adds two tables `class_proba` and `retweet_class_proba` to the SQLite database with the result of the classification of each tweets and original retweeted status.

*Optional parameters:*
- `propa_table_name_suffix` : add a suffix to the two table names in order to compare different classifiers. Default is '' (empty string).

In [27]:
classifyTweets(job).run()

loading classifier.pickle
Table retweet_class_proba already exists in database.
Do you want to drop the table (irreversibly delete it) and replace it? (y/n)y
0 over 0.0307
** row : 0 to 9999

total time : 0.0003566741943359375
getting tweets from retweeted_status
updating retweet_class_proba
finished
*** took 0.05206s
Table class_proba already exists in database.
Do you want to drop the table (irreversibly delete it) and replace it? (y/n)y
0 over 1.6071
** row : 0 to 9999

total time : 0.0002892017364501953
getting tweets from tweet
updating class_proba
1 over 1.6071
** row : 10000 to 19999

total time : 1.2228515148162842
getting tweets from tweet
updating class_proba
finished
*** took 2.017s


### 10. Analyze classification results
`makeProbaDF` reads the classification results from the database and processes them to:
- Replace the classification probability of retweets with the classification results of the original tweets.
- Replace the classification probability of tweets having a hashtag of one of the two camps (and not of the other camp) with 0 (for camp1) or 1 (for camp2).
- Discard tweets emanating from unoffical Twitter clients.

The results are saved as a pandas dataframe in `df_proba_filename`.

*Optional parameters:*
- `use_official_clients` : whether you want to keep only tweets from official clients (`True`) or all tweets (`False`). Default is `True`.
- `propa_table_name_suffix` can be changed to use the classification of different classifiers if it was used with `classifyTweets`.
- `column_name_ht_group` is also used if it was changed to create a different training set.


In [28]:
makeProbaDF(job).run()

querying sql
creating df proba
creating df_proba_original_rt
*** took 0.1104s
creating df_proba_rt
*** took 0.1791s
creating df_proba_ht_pro_0
*** took 0.1992s
creating df_proba_ht_pro_1
*** took 0.2313s
saving corrected dataframe
done
*** took 0.2584s


`analyzeProbaDF` reads `df_proba_filename` and returns the number of tweets and the number of users in each camp per day. The results are displayed and saved as pandas dataframes to `df_num_tweets_filename` and `df_num_users_filename`.

*Optional parameters:*
- `ncpu` : number of cores to use. Default is number of cores of the machine minus one.
- `resampling_frequency` : frequency at which tweets are grouped. Default is `'D'`, i.e. daily. (see [here](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for different possibilities.)
- `threshold` : threshold for the classifier probability (threshold >= 0.5). Tweets with p > threshold are classified in camp2 and tweets with p < 1-threshold are classified in camp1. Default is 0.5.
- `r_threshold` : threshold for the ratio of classified tweets needed to classify a user. Default is 0.5.

In [29]:
analyzeProbaDF(job).run()

loading df_proba.pickle
threshold: 0.5
r_threshold: 0.5
computing stats
finished
1.1803758144378662

Number of tweets per day in each camp:
            n_pro_0  n_pro_1
2016-10-20        2        8
2016-10-21       10       62
2016-10-22        8       18
2016-10-24       98       63
2016-10-25       46       50
2016-10-26        6      135
2016-10-27        7       28
2016-10-28        6       29
2016-10-29        3        7
2016-10-30       46       40
2016-10-31       54       82
2016-11-01       12       19
2016-11-02        9       34
2016-11-03        6       23
2016-11-04       12       30
2016-11-05        5       19
2016-11-06        1       26
2016-11-07        6       34
2016-11-08       12       38
2016-11-09      181      283
2016-11-10       43      163
2016-11-11        5       26
2016-11-12       10       16
2016-11-13        1       26
2016-11-14        5       24
2016-11-15       30       52
2016-11-16       36       89
2016-11-17       84     3342
2016-11-18       75

In [30]:
#print all job parameters
job

{'best_params_file': 'best_params.json',
 'classifier_filename': 'classifier.pickle',
 'df_num_tweets_filename': 'df_num_tweets.pickle',
 'df_num_users_filename': 'df_num_users.pickle',
 'df_proba_filename': 'df_proba.pickle',
 'features_pickle_file': 'features.pickle',
 'features_vect_file': 'features.mmap',
 'graph_file': 'graph_file.graphml',
 'htgs_lists': [['money',
   'stocks',
   'stockmarket',
   'cash',
   'market',
   'ameritrade',
   'scottrade'],
  ['401k', 'finance', 'etrade', 'tradeking', 'stock', 'amtd', 'alerts']],
 'initial_htgs_lists': [['money'], ['401k']],
 'labels_mappers_file': 'labels_mappers.pickle',
 'labels_pickle_file': 'labels.pickle',
 'labels_vect_file': 'labels.mmap',
 'n_iter': 2,
 'propag_results_filename': 'propag_results.pickle',
 'sqlite_db_filename': 'test.sqlite',
 'tweet_archive_dirs': ['etrade']}