Pour pouvoir faire tourner ce code, il faut installer en plus les modules :
- pymongo
- tweepy
- textblob
- textblob_fr

Normalement, ils sont tous disponibles à travers la commande  
`pip install nom_du_module`

Il faut également lancer mongodb (à travers kitematic ou un docker ici) et utiliser le script `tweepy_save.py` pour remplir la base de données avec quelques tweets.

Pour que le script fonctionne, il faut s'inscrire à l'API public de twitter (https://apps.twitter.com/) pour pouvoir remplir les champs suivants :

```
consumer_key = "" 
consumer_secret = ""  
access_token = ""  
access_token_secret = ""  
```

Voir ici par exemple :https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/

Pour ceux qui souhaitent expérimenter l'API twitter, la documentation de tweepy est disponible ici :
http://docs.tweepy.org/en/v3.5.0/getting_started.html

Veuillez noter que le code ci-dessous est fourni à titre d'expérimentation, et n'est pas optimisé de quelque manière que ce soit !

In [54]:
from pymongo import MongoClient

In [55]:
mongo_server = "192.168.99.100"
mongo_port = 32768
client = MongoClient(mongo_server, mongo_port)

In [56]:
db = client['twitter_db']
collection = db['twitter_collection']

In [57]:
from pprint import pprint #pretty print
tweet = collection.find_one()
pprint(tweet)

{'_id': ObjectId('59e05acda56ff63cc06dffd7'),
 'contributors': None,
 'coordinates': None,
 'created_at': 'Fri Oct 13 06:18:30 +0000 2017',
 'entities': {'hashtags': [{'indices': [50, 58], 'text': 'startup'},
                           {'indices': [101, 109], 'text': 'funding'},
                           {'indices': [110, 115], 'text': 'news'}],
              'symbols': [],
              'urls': [{'display_url': 'startnexcel.com/blog/thirdwatc…',
                        'expanded_url': 'http://startnexcel.com/blog/thirdwatch-data-pvt-ltd-an-ai-powered-anti-fraud-solutions-startup-got-seed-funding-from-ian-others/',
                        'indices': [117, 140],
                        'url': 'https://t.co/WZ4324PZVf'}],
              'user_mentions': [{'id': 777811697729802241,
                                 'id_str': '777811697729802241',
                                 'indices': [0, 13],
                                 'name': 'ThirdWatch',
                                 'scr

In [58]:
tweet.keys()

dict_keys(['_id', 'created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])

In [59]:
tweet["text"]

'@thirdwatchai, an AI-powered anti-fraud solutions #startup got seed funding from @ianetwork, others\n\n#funding #news\n\nhttps://t.co/WZ4324PZVf'

In [60]:
from textblob import Blobber
from textblob_fr import PatternTagger, PatternAnalyzer
tb = Blobber(pos_tagger=PatternTagger(), analyzer=PatternAnalyzer())

In [61]:
from nltk.tokenize import word_tokenize
from nltk import bigrams

import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

In [62]:
cursor = collection.find({'lang':'en'})
tweets = []
for document in cursor:
    tweets.append(preprocess(document["text"]))

In [63]:
len(tweets)

1769

In [70]:
tweets = [' '.join(doc) for doc in tweets]

In [65]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
#tf_transformer = TfidfTransformer().fit(tweets)
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')

In [71]:
vectorizer.fit_transform(tweets)

<1769x1989 sparse matrix of type '<class 'numpy.float64'>'
	with 10467 stored elements in Compressed Sparse Row format>

In [74]:
X = vectorizer.fit_transform(tweets)

In [75]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

svd = TruncatedSVD(50)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X = lsa.fit_transform(X)

In [76]:
from sklearn.cluster import KMeans, MiniBatchKMeans
k = 10
km = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1,
                verbose=False)

In [77]:
km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=10, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=False)

In [78]:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()

for i in range(k):
    print(i)
    for ind in order_centroids[i, :10]:
        print(terms[ind])

0
amp
tech
robotics
solutions
release
art
alliance
blue_prism
form
clients
1
mailonline
weinstein
harvey
calls
emma
thompson
predator
delhi
police
sex
2
intelligence
artificial
human
future
time
work
saving
technology
doctors
patients
3
know
world
tech
need
data
business
don
big
china
fintech
4
use
learning
using
technology
microsoft
amazon
google
news
cloud
twitter
5
iot
bigdata
machinelearning
ml
deeplearning
datascience
dl
artificialintelligence
vr
fintech
6
love
lol
make
really
machinelearning
mailonline
chatbot
youtube
business
stop
7
like
video
youtube
don
launches
mailonline
liked
good
facebook
analysis
8
just
united
gt
manchester
liverpool
mailsport
old
brexit
mailonline
mumbai
9
new
google
god
china
apple
face
box
robots
post
needed
