Tweet Classifier

This set of python scripts is a real-time engine which loads news feeds from some well-known news agencies twitter channels and identifies if they are important or not. The important news in each news-topic category is identified and pushed to users. The following tasks are performed by the engine:

Collects, reads tweets data using twitter_api and transforms it into a csv format.
Trains the topic classifier using external training dataset to predict new news feed (query) topic. The topics are technology, politic, sport, entertainment and business (supervised learning).
Trains the hot classifier using a pre-trained word2vec data to distinguish hot news from the other ones (unsupervised learning).
Every 10 minutes, the engine downloads new news feeds and groups them into similar topics and identifies if they are important or not.
Identifies the best tweet for every topic.
Pushes at most two real-time important (best) tweets to users each day.
Stores the notified tweets.

Dependencies

Gensim
Twitter
Numpy
Scipy
Pandas
Sklearn
Nltk

How to use it

Set project root to project folder directory.
Extract classifiers\news_data.zip under folder classifiers
Download the w2v pre-trained data (vectors.text) from https://drive.google.com/file/d/0B21S11HaS5mxY0xfVnU3Q1VqNTA/view?usp=sharing and put it under folder tweet_data\w2v_trained_data
Simply run script main\app.py to run the engine and enjoy!! :).

classification algorithms

Topic classifier (supervised learning)

The classifier takes the tweets and place them into one of k topic classes.The classifier uses the database consisting of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.

Natural Classes: 5 (business, entertainment, politics, sport, tech)

In this project, the classifier uses a number of scikit learn multi-class classifiers such as

OneVsRestClassifier: One-vs-the-rest (OvR) multiclass/multilabel strategy (http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)
SVC: C-Support Vector Classification. (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
DecisionTreeClassifier: A decision tree classifier. (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
RandomForestClassifier: A random forest classifier. (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
NearestCentroid: Nearest centroid classifier. (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html)
MLPClassifier: Multi-layer Perceptron classifier. (http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

Hot classifier (unsupervised learning)

The classifier takes tweets and identifies hot ones. For this purpose, the classifier determines the importance of tweets based on their content and puts them into different classes. The classes are 'cold', 'normal', 'warm', 'hot' and 'breaking'. The relevant class to a tweet is identified based on the number of favorites and retweets of the tweet.

For each authenticating news agency, the classifier initially normalizes the number of favorites and retweets. Then, a linear combination of these two normalized values is determined as

 importance_score=\alpha*\norm(number_favorite)+(1-\alpha)*\norm(number of retweets)

and stored in importance_score. This score is utilized as a labelled data showing the importance of the tweets. Following thresholds are used to classify the training tweets:

- 0.0 < importance_score < 0.1  --------> 'cold'
- 0.1 < importance_score < 0.4  --------> 'normal'
- 0.4 < importance_score < 0.6  --------> 'warm'
- 0.6 < importance_score < 0.8  --------> 'hot'
- 0.8 < importance_score < 1.0  --------> 'breaking'

For the dumped tweets, the class and the corresponding score to each tweet are determined. These tweets are used as labelled tweets. For each query (new tweet), the classifier calculates the distance between the query and the other tweets using following methods:

word2vec: Word2vec model. Each tweet contains number of words. In this model, the classifier uses a pre-trained model (vector.text - 100MB) to produce word embedding vector for the words. By averaging these vectors, an embedding vector representation for the tweet is obtained. For this purpose, classifier uses gensim (https://radimrehurek.com/gensim/) for word embedding. After determining tweet embedding vectors, different classifiers are used as follows:
```
            ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process", "Decision Tree", 
            "Random Forest", "Neural Net", "AdaBoost","Naive Bayes", "QDA"]
```
tfidf: TF-IDF model (more info: https://radimrehurek.com/gensim/models/tfidfmodel.html)
lsi: Latent Semantic Indexing (more info: https://radimrehurek.com/gensim/models/lsimodel.html)
rp: Random Projections (more info: https://radimrehurek.com/gensim/models/rpmodel.html)
dp: Hierarchical Dirichlet Process (more info: https://radimrehurek.com/gensim/models/hdpmodel.html)
lda: Latent Dirichlet Allocation (more info: https://radimrehurek.com/gensim/models/ldamodel.html)
lem: LogEntropy model (more info: https://radimrehurek.com/gensim/models/logentropy_model.html)

For comments, updates and patch submissions, please contact:

Ali Nadaf, phd
Data Scientist
<ali.nadaf@gmail.com>

References

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
classifiers		classifiers
main		main
tweet_data		tweet_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classifiers

classifiers

main

main

tweet_data

tweet_data

README.md

README.md

Repository files navigation

Tweet Classifier

Dependencies

How to use it

classification algorithms

Topic classifier (supervised learning)

Hot classifier (unsupervised learning)

References

About

Releases

Packages

Languages

anadaf/tweet_classification

Folders and files

Latest commit

History

Repository files navigation

Tweet Classifier

Dependencies

How to use it

classification algorithms

Topic classifier (supervised learning)

Hot classifier (unsupervised learning)

References

About

Resources

Stars

Watchers

Forks

Languages