## Lab 6 - Analyzing Reviews on Real Estate Services "CIAN" and "Yandex.Realty"
We have reviews left on Yandex on Cian.ru and Realty.Yandex.ru. These reviews are open to everyone and can be seen on Yandex.Browser:
<img src = 'reviews_data/reviews_yandex_browser.png'>



### Main objectives
After successful completion of the lab work students will be able to:
-	Use reviews datasets to automatically analyze customer views on competitors. This technique is useful both for product managers and product analysts.


### Tasks
-	Create a classifier to differentiate between reviews on 2 competitors in order to get most differentiating words and bigrams that customer used
-   Find out how customer view on the product changed over time



In [None]:
# let's import pandas library and set options to be able to view data right in the browser
# Assign pd for a short alias for pandas library
import pandas as pd

# useful options to display more data from dataframes right in the notebook
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.max_rows', 500)

# in order to display plots right in the notebook:
from matplotlib import pyplot as plt
%matplotlib inline

# libraries for working with text data
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
import nltk

In [None]:
REALTY_REVIEWS_PATH = 'reviews_data/realty.reviews.tsv'
CIAN_REVIEWS_PATH = 'reviews_data/cian.reviews.tsv'

In [None]:
cian_reviews = pd.read_csv(CIAN_REVIEWS_PATH, sep = '\t')
realty_reviews = pd.read_csv(REALTY_REVIEWS_PATH, sep = '\t')

In [None]:
# let's investigate the data
# HINT for non-russian speaking students: please use https://translate.yandex.com/ or ask colleagues from your teams to 
# understand what people right in reviews
cian_reviews.head()

In [None]:
realty_reviews.head()

In [None]:
# let's check that we don't have hosts other then cian.ru and realyt.yandex.ru in the data
cian_reviews.host.value_counts()

In [None]:
realty_reviews.host.value_counts()

### Let's look at what are the popular words in CIAN and Yandex.Realty reviews

In [None]:
# get stopwords for russian language
nltk.download("stopwords")
russian_stopwords = stopwords.words("russian")

In [None]:
# let's look at what particular words are put in this list and check whether it makes sense
# tak top 10 elements from the list
russian_stopwords[0:10]

In [None]:
# create the WordCloud using russian stopwords to see which words are used in reviews:
wordcloud = WordCloud(background_color='black', stopwords = russian_stopwords,
                max_words = 200, max_font_size = 100, 
                random_state = 17, width=800, height=400)

In [None]:
list(realty_reviews.loc[:, 'text'])

In [None]:
plt.figure(figsize=(16, 12))
# take all rows from 'text' column and generate WordCloud
all_texts = list(realty_reviews.loc[:, 'text'])
wordcloud.generate(" ".join(all_texts))
plt.imshow(wordcloud);

In [None]:
all_5_star_texts = list(realty_reviews.loc[realty_reviews['rating'] == 5, 'text'])
wordcloud.generate(" ".join(all_5_star_texts))
plt.imshow(wordcloud);

In [None]:
# let's create a method to generate wordcloud from reviews from different ratings:
def gen_wordcloud(df, rating = None, text_column_name = 'text'):
    if rating is None:
        all_texts = list(df.loc[:, text_column_name])
    else:
        all_texts = list(df.loc[df['rating'] == rating, text_column_name])
    wordcloud.generate(" ".join(all_texts))
    plt.imshow(wordcloud)

In [None]:
gen_wordcloud(realty_reviews, rating = 4)

In [None]:
gen_wordcloud(realty_reviews, rating = 3)

In [None]:
gen_wordcloud(realty_reviews, rating = 2)

In [None]:
gen_wordcloud(realty_reviews, rating = 1)

### Self-control stops
- Generate WordCloud for all CIAN reviews and CIAN reviews for each rating from 5 to 1. How are the most popular words used different from those in Yandex.Realty reviews?

### Let's analyze the most popular n-grams from reviews
n-gram is a sequence of n consecutive words from the texts

#### 1st step is to get a lemmatized list of tokens from text. It's important to count same words in different forms as the same word

In [None]:
from pymystem3 import Mystem
from string import punctuation

In [None]:
#Create lemmatizer
mystem = Mystem() 

#Preprocess function
def preprocess_text(text):
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text

#Examples    
preprocess_text("Ну что сказать, я вижу кто-то наступил на грабли, Ты разочаровал меня, ты был натравлен.")


In [None]:
# preprocess all texts and save them in separate column
realty_reviews['preprocessed_text'] = realty_reviews['text'].map(preprocess_text)

In [None]:
# see sample of preprocessing result:
realty_reviews['preprocessed_text'].sample(5)

#### 2nd step: get most popular n-grams from reviews with different rating

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [None]:
import collections

In [None]:
def get_top_n_grams(df_reviews, 
                    rating = 5, 
                    ngrams = 2, 
                    rating_col_name = 'rating', 
                    text_col_name = 'preprocessed_text'):
    '''
    returns most popular n-grams used in reviews
    support 1,2,3-grams
    '''
    counts = collections.Counter()
    for review in df_reviews[df_reviews[rating_col_name] == rating][text_col_name]:
        words = nltk.word_tokenize(review)
        if ngrams == 2:
            counts.update(nltk.bigrams(words))
        elif ngrams == 1:
            counts.update(words)
        elif ngrams == 3:
            counts.update(nltk.trigrams(words))
    counts_popular = {k: v for k, v in counts.items() if v > 1}
#     sorted_counts = {k: v for k, v in sorted(counts_popular.items(), key=lambda item: item[1], reverse = True) }
    sorted_counts = {k: v for k, v in sorted(counts.items(), key=lambda item: item[1], reverse = True) }
    return sorted_counts
    

In [None]:
nltk.download("punkt")
get_top_n_grams(df_reviews=realty_reviews, rating = 5, ngrams=3)

### Self-control stops
- Analyze top 2-grams and 3-grams for both CIAN and Yandex.Realty for 5-star reviews and 1-star reviews. What are the common things people say about services and what are the differencies?
- What are the mean and median ratings for CIAN and Yandex.Realty reviews? How the mean and median ratings change if we look just at recent reviews for the last year from 2019-10-10?
Hint: to filter by date you can just use date in yyyy-mm-dd format lik this - ``realty_reviews[realty_reviews.day >= '2019-10-01']``)
- Some authors have used both services and wrote their reviews. What are the mean and median ratings for CIAN and Yandex.Realty reviews written by the same authors? 

### Try to build classifier for cian/realty reviews and analyze factors
#### prepare the dataset

In [None]:
cian_reviews['preprocessed_text'] = cian_reviews['text'].map(preprocess_text)

In [None]:
df_all = pd.concat([cian_reviews, realty_reviews])

In [None]:
len(df_all)

In [None]:
list(df_all)

In [None]:
# make a label for classified
df_all['label_yandex'] = df_all['host'].map({'cian.ru': 0, 'realty.yandex.ru': 1})

In [None]:
df_all.sample(5)

#### prepare testing and training datasets and train the model to classify whether review is written about Yandex.Realty

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

In [None]:
stopwords_classifier = ['яндекс', 'сервис', 'саит']
def train_text_classfier(df, label = 'label_yandex'):
    '''
    preprocessed text and label_yandex should be in passed df
    '''
    xtrain, xvalid, ytrain, yvalid = train_test_split(df.preprocessed_text.values, df[label].values, 
                                                  stratify=df.label_yandex.values, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)
    tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
                      stop_words=stopwords_classifier)

    # Fitting TF-IDF to both training and test sets (semi-supervised learning)
    tfv.fit(list(xtrain) + list(xvalid))
    xtrain_tfv =  tfv.transform(xtrain) 
    xvalid_tfv = tfv.transform(xvalid)
    
    # Fitting a simple Logistic Regression on TFIDF
    clf = LogisticRegression(C=1.0)
    clf.fit(xtrain_tfv, ytrain)

    predictions = clf.predict(xvalid_tfv)
                          
    return clf, tfv, yvalid, predictions 

In [None]:
clf, tfv, yvalid, predictions = train_text_classfier(df_all)

In [None]:
confusion_matrix(yvalid, predictions)

In [None]:
roc_auc_score(yvalid, predictions)

In [None]:
accuracy_score(yvalid, predictions)

#### analyze which factors were the most important in text to decide whether review was writtent about yandex

In [None]:
import eli5
eli5.show_weights(estimator=clf,
                 vec = tfv, top=50)

### Self-control stops
- Analyze from top factors which words should be added to stopwords list. Add them and rerun the classifier. What are the top words, which differntiate Yandex.Reatly from CIAN? Find some reviews with them to understand the context.
- Make the same analysis for positive reviews with ratings of 4 and 5, and for negative reviews, with ratings of 1 and 2. What are the top factors, which differentiate Yandex.Realty from CIAN for positive and negative reviews?