# Part 4 Build with Feature Extraction and Logistic Regression and Predict on Foursquare Data


Part 4 Build + Predict
BUILD: Create a data model
- Selecting the appropriate model
- Building a model
- Testing and training our model
- Evaluating and refining our model



PREDICT: Predict on new data from foursquare API 
- loading the cleaned data from csv file
- using the trained model create predictions 
- create visualisations to answer my question

In [None]:
import pandas as pd
import numpy as np

### Loading the Training Dataset

Training:
- polarity : Target 2:positive, 1:neutral, 0:negative
- words : preprocessed sentences
- type : the tags of the words from lemmatizing 

In [None]:
training = pd.read_csv('./train_test_data/training_bs.csv', encoding='utf8')
training.head()

### Check for null values

In [None]:
print training.isnull().sum()

### Set up X and y

In [None]:
X_train = training['lem_words']
y_train = training['sentiment']

print X_train.shape
print y_train.shape

### Baseline Accuracy
- The baseline accuracy is the proportion of the majority class. In this case '2' which is positive sentiment and so the baseline accuracy is 0.3

baseline_accuracy = majority class N / total N


In [None]:
print y_train.value_counts(normalize=True)
baseline = 0.3

## Comparing Count Vec and TFIDF
- Count Vectorizer 
- TFIDF Vectorizer 
- Logistic Regression is the classifier used.. very fast classifier???

In [None]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score

lr = LogisticRegression(random_state=1)

##### Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
# initalise the vectoriser 
cvec = CountVectorizer()
# fit the training data on the model
cvec.fit(X_train)

#transform training data into sparse matrix
X_train_cvec = cvec.transform(X_train)

# cross val score/ predict
cvec_score = cross_val_score(lr, X_train_cvec, y_train, cv=3)


##### TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
# initalise the vectoriser 
tvec = TfidfVectorizer()
# fit the training data on the model
tvec.fit(X_train)

#transform training data into sparse matrix
X_train_tvec = tvec.transform(X_train)

# cross val score/ predict
tvec_score = cross_val_score(lr, X_train_tvec, y_train, cv=4)

### Comparing scores
- Both vectorizers increased the accuracy from the baseline
- Count Vectorizer highest score
- Hyperparameters will be further tested for the Count Vec


In [None]:
print 'Baseline:', baseline
print 'Tfidf Vectorizer Score:', tvec_score.mean()
print 'Count Vectorizer Score:', cvec_score.mean()
acc_list = []
acc_list.append(cvec_score.mean())
acc_list.append(tvec_score.mean())

# DataFrame Accuracy 
acc_df = pd.DataFrame()
acc_df['params']= ['cvec', 'tvec']
acc_df['scores']= acc_list
acc_df

# EDA of Count Vectorizer
We still have the same number of rows but the vectorization has converted every word, or what is believed to be a word, from our test data into a feature. This is like dummy coded variables for words except that we have counts rather than just occurances.???.... featured names of wueds frequency of the top highest words. 

#### Word Frequency 
matrix outputting the 10 most common words and how many times they appear 
- Feature matrix of word occurences 
- top 10 word most occuring words
- most important words

In [None]:
df_cvec = pd.DataFrame(X_train_cvec.todense(),columns=cvec.get_feature_names())
word_freq = df_cvec.sum(axis=0).sort_values(ascending=False)[:10]
word_freq

#### COUNT VEC - Zipf's law
It state that 
In a corpus of text, any words frequency is inversely proportional to its rank in the frequency table. 

Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation


Unsurprisingly, the word frequencies for the collected works of Alexandre Dumas appear to follow the predictions made by Zipf’s law! In this blog post, I’ll show you how to use bash and R to perform this simple analysis.


Plotting the word frequency distribution, we observe : what is thissssssssss? https://en.wikipedia.org/wiki/Zipf%27s_law 

In [None]:
import matplotlib.pyplot as plt
word_freq.plot(kind='hist',
            title='Number of words with a given number of appearances',
            fontsize=14)
plt.show()

 in theory the more/less max features???? show compare write

# Tuning Hyperparameters Count Vectorizer 


In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

### N_gram

In [None]:
def count_vec_ngram(params, X_train, y_train):
    cvec_p = CountVectorizer(ngram_range=(params)) 

    cvec_p.fit(X_train)
    X_train_cvec_p = cvec_p.transform(X_train)

    # cross val score/ predict
    cvec_score_p = cross_val_score(lr, X_train_cvec_p, y_train, cv=3)

    # cross validation 
    return cvec_score_p.mean()

params = [(1,1), (1,2),(1,3), (1,4)] 
ngram_scores = []
for p in params:
    ngram_scores.append(count_vec_ngram(p, X_train, y_train))
    
ngrams = ['cvec gram_1','cvec gram_2','cvec gram_3','cvec gram_4']
ngram_df = pd.DataFrame({'params':ngrams, 'scores':ngram_scores}, index=[0,1,2,3])
# adding cvec score with default params
ngram_df = ngram_df.append(acc_df.iloc[:1,:])

# plot scores on graph
sns.pointplot(x='params', y='scores', data =ngram_df)
plt.ylabel('Accuracy Score')
plt.xlabel('ngrams')
plt.show()

In [None]:
# update accuracy scores with highest score for 1,2 ngram
acc_df1 = acc_df.append(ngram_df.iloc[3,:])
acc_df1.reset_index(inplace=True, drop=True)
acc_df1

### Max Features

In [None]:
def count_vec_max_features(params, X_train, y_train):
    cvec_p = CountVectorizer(max_features=params) 

    cvec_p.fit(X_train)
    X_train_cvec_p = cvec_p.transform(X_train)

    # cross val score/ predict
    cvec_score_p = cross_val_score(lr, X_train_cvec_p, y_train, cv=3)

    # cross validation 
    return cvec_score_p.mean()

mf_params = [None, 500, 1000, 5000, 10000]
max_features_scores = [count_vec_max_features(p, X_train, y_train) for p in mf_params]
max_features = ['max_f_'+str(p) for p in mf_params]

# dataframe for scores
max_features_df = pd.DataFrame({'params':max_features, 'scores':max_features_scores}, index=[0,1,2,3,4])
# adding cvec score with default params
max_features_df = max_features_df.append(acc_df.iloc[:1,:])

sns.pointplot(x='params', y='scores', data =max_features_df)
plt.ylabel('Accuracy Score')
plt.xlabel('ngrams')
plt.show()

In [None]:
# update accuracy dataframe with 3 highest scores
acc_df2 = acc_df1.append(max_features_df.drop(max_features_df.index[[1,2]]))
acc_df2.reset_index(inplace=True, drop=True)
acc_df2

### max_df

In [None]:
def count_vec_max_df(params, X_train, y_train):
    cvec_p = CountVectorizer(max_df=params) 

    cvec_p.fit(X_train)
    X_train_cvec_p = cvec_p.transform(X_train)

    # cross val score/ predict
    cvec_score_p = cross_val_score(lr, X_train_cvec_p, y_train, cv=3)

    # cross validation 
    return cvec_score_p.mean()

mdf_params = [0.25, 0.5, 0.75, 1.0]
max_df_scores = [count_vec_max_df(p, X_train, y_train) for p in mdf_params]
max_df = ['max_df_'+str(p) for p in mdf_params]

# dataframe for scores
max_df_df = pd.DataFrame({'params':max_df, 'scores':max_df_scores}, index=[0,1,2,3])
# adding cvec score with default params
max_df_df = max_df_df.append(acc_df.iloc[:1,:])

sns.pointplot(x='params', y='scores', data =max_df_df)
plt.ylabel('Accuracy Score')
plt.xlabel('max_df')
plt.show()

In [None]:
# update accuracy dataframe
acc_df3 = acc_df2.append(max_df_df.iloc[:2,:])
acc_df3.reset_index(inplace=True, drop=True)
acc_df3

# Perform Count Vectorizer with chosen parameters
- ngram_range = (1,2): Bigram had the highest score
- max_features: will not be used as the highest scores were the same as default params. 
- max_df =  0.25: Both 0.25 and 0.5 gave the same score. The general trend of this param is the lower the threshold the higher the score. so between the 0.5 and 0.25 the lowest is chosen

In [None]:
##### Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
# initalise the vectoriser 
cvec = CountVectorizer(ngram_range=(1,4), max_df=0.25)
# fit the training data on the model
cvec.fit(X_train)

#transform training data into sparse matrix
X_train_cvec = cvec.transform(X_train)

# cross val score/ predict
cvec_score = cross_val_score(lr, X_train_cvec, y_train, cv=3)
cvec_score.mean()


acc_df3.loc[8]= ['best_params', cvec_score.mean()]
acc_df3.sort_values('scores', ascending=False)


# Highest Score of Feature Transform Optimization: Bigram
- As shown above the combined best parameters gave a lower score than bigram range with default params. 
- The Bigram range is then inputted as a fixed param to then optimize the Logistic Regression 

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# Transform features once!
cvec_p = CountVectorizer(ngram_range=(1,4)) 
cvec_p.fit(X_train)
X_train_cvec_p = cvec_p.transform(X_train)


In [None]:
model_l1 = LogisticRegressionCV(Cs=np.logspace(-10,10,21),penalty = 'l1',solver='liblinear',cv=5) 
model_l1.fit(X,y)

## Gridsearch on Logistic Regression

In [None]:
# Transform features once!
cvec_p = CountVectorizer(ngram_range=(1,4)) 
cvec_p.fit(X_train)
X_train_cvec_p = cvec_p.transform(X_train)


In [None]:
lr_params = {'penalty': ['l1','l2'],
          'solver':['liblinear'],
          'C': np.logspace(-10,10,21)}

# we define the gridsearchCV
lr_grid = GridSearchCV(lr, param_grid=lr_params, cv=3, n_jobs=-1, verbose=1)

# fit with the tranformed sparse matrix
lr_grid.fit(X_train_cvec_p, y_train)

print 'Best Score:', lr_grid.best_score_
print
# assign the best estimator to a variable:
best_lr = lr_grid.best_estimator_
print 'Best Params:', lr_grid.best_params_

## Logistic Regression CV 

In [None]:
# 
# from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict
# from sklearn.metrics import accuracy_score


# try logistic regression CV???? instead of grid search???????

statement....

In [None]:
from sklearn.linear_model import LogisticRegressionCV
lr_cv = LogisticRegressionCV(Cs=np.logspace(-10,10,21),penalty = 'l1',solver='liblinear',cv=5) 
model_l1.fit(X,y)

# Evaluate on Test set

In [None]:
testing = pd.read_csv('./train_test_data/testing.csv')
testing.head(3)

In [None]:
X_test = testing.lem_words
y_test = testing['sentiment']

In [None]:
# transform with cvec and predict on best log reg
X_test_mat = cvec_p.transform(X_test)
y_pred = best_lr.predict(X_test_mat)

print 'Best CVal on training:', lr_grid.best_score_
print 'Best Model on testing:', accuracy_score(y_test, y_pred)
print "Number of classification errors:", np.abs(y_pred - y_test).sum() 
print 'Total:', len(y_test)

In [None]:
cvec_p.get_feature_names()

- The model created had had a high accuracy score of 0.94 
- after scoring on test set it lowered to 0.67 
- which is still significantly higher than the baseline of 0.3
- however shows the model is overfitting on the training data


# Confusions matirix 
- I will evaluate my model through showing the accuracy

In [None]:
from sklearn.metrics import confusion_matrix
import itertools
import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test,y_pred)
confusion = pd.DataFrame(np.array(cm), index=['True_Negative', 'True_Neutral', 'True_Positive'], 
                         columns =['Pred_Negative', 'Pred_Neutral', 'Pred_Positive'])
confusion

In [None]:
x_class = ['Negative', 'Neutral', 'Positive']
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plot_confusion_matrix(cm, classes=x_class, title='Confusion matrix, without normalization')
plt.show()

In [None]:
# Plot normalized confusion matrix
plot_confusion_matrix(cm, classes=x_class, normalize=True, title='Normalized confusion matrix')
plt.show()

http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
https://medium.com/tech-vision/introduction-to-confusion-matrix-classification-modeling-54d867169906

https://www.youtube.com/watch?v=FAr2GmWNbT0 
- TP TN FP FN??? 

- Accuracy is the ratio of True positive + True negative values over the whole population (ratio of correctly predicted values) 

- These models all have relatively high scores however their accuracy may not be the best due to their highly unbalanced classes. for instance if we were to predict a spam email with the ratio of 100 spam emails and 900 non-spam. Then we can only say that the prediction is only 10% accurate.

- To increase the accruacy I would suggest resampling with replacement using boostrapping to balance the classes. For example spam emails could be upsampled to 500 and non spam down sampled to 500. 

# Classification Report

In [None]:
from sklearn.metrics import classification_report
print classification_report(y_test,y_pred)

### Predicting Probabilities for Classes and Roc Curve

In [None]:
Y_pp_mat =best_lr.predict_proba(X_test_mat)

print 'predicted probabilities for each class:' 
# Get the predicted probability vector and explicitly name the columns:
Y_pp = pd.DataFrame(Y_pp_mat, columns=['class_0_pp','class_1_pp', 'class_2_pp'])
# Y_pp['pred_class_thresh10'] = [1 if x >= 0.10 else 0 for x in Y_pp.class_1_pp.values]
Y_pp.head()


In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
from scipy import interp

In [None]:
# Binarize the output
y_test_b = label_binarize(y_test, classes=[0, 1, 2])
n_classes_test = y_test_b.shape[1]

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes_test):
    fpr[i], tpr[i], _ = roc_curve(y_test_b[:, i], Y_pp_mat[:,i])
    roc_auc[i] = auc(fpr[i], tpr[i])


x_class = ['Negative', 'Neutral', 'Positive']
# Plot of a ROC curve for a specific class
plt.figure(figsize=[6,6])
for i in range(n_classes_test):   
        plt.plot(fpr[i], tpr[i], label= str(x_class[i])+ ' (area = %0.2f)' % roc_auc[i], linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('ROC Curve for Sentiment Classification', fontsize=18)
plt.legend(loc="lower right")
plt.show()


In [None]:
y_test_b.shape # check for variable in cell above!

# Threshold???

In [None]:
tpr = 
fpr = 
threshold = 

print 'fpr\t', 'tpr\t', 'threshold'
print np.array(zip(fpr,tpr,threshold))
# higher recall remins high even with threshold value
# surface under curve area canoot be

# Features importances

In [None]:
# cvec_p.get_feature_names() checkkkkkkk 
# feature names which words effected the model most
best_lr.coef_[0]

In [None]:
coef_df = pd.DataFrame({
        'coef':best_lr.coef_[0]})
coef_df['abs_coef'] = np.abs(coef_df.coef)
coef_df['feature_names'] = cvec_p.get_feature_names()
# sort by absolute value of coefficient (magnitude)
# coef_df.sort_values('abs_coef', ascending=False, inplace=True)

coef_df.head()

In [None]:
# Show non-zero coefs and predictors
# coef_df[coef_df.coef != 0]
len(coef_df[coef_df.coef != 0])

In [None]:
 is it similar??? 
# tvec_df  = pd.DataFrame(tvec.transform(training.words).todense(),
#                    columns=tvec.get_feature_names())
# tvec_df.sum(axis=0).sort_values(ascending=False)[:10]
# top words....

## Foursquare data
    - polarity : Target 2:positive, 1:neutral, 0:negative
    - words : preprocessed sentences
    - type : the tags of the words from lemmatizing 


TFDIF was fitted on the training data which will be used to transfomation the words in testing into a sparse matrix
Logistic Regressions best parameters which were fit for the training data will then predict sentiment (y_hat) for the transformed testing data.



---- logistic regression gave a good score in classifying the predictors but will try different classifiers to determine best accuracy score 


In [None]:
foursquare = pd.read_csv('./foursquare_clean.csv', encoding='utf8')
# foursquare = foursquare.dropna()
X = foursquare.lem_words

In [None]:
foursquare.shape

In [None]:
# transform the testing data
X_mat = tvec.transform(X)

# predictive probabilities

# y_hat_pp = best_lr.predict_proba(X_mat)
# y_hat_pp ???? can i ahve these on the foursquare data????

# Y_pp = pd.DataFrame(best_lr.predict_proba(X_test_mat), columns=['class_0_pp','class_1_pp', 'class_2_pp'])
# Y_pp.head(10) 

# probability in class 1 class 2 class 3 

In [None]:
foursquare['polarity_pred'] = best_lr.predict(X_mat)
lng=[]
lat=[]
for ll in foursquare['ll']:
    lnglat = ll.split(',')
    lng.append(lnglat[0])
    lat.append(lnglat[1])
foursquare['lng'] =lng
foursquare['lat'] =lat

# dummies = pd.get_dummies(foursquare.polarity_pred)
# dummies.columns = ['negative_0', 'neutral_1', 'positive_2']
# foursquare_all = pd.concat([foursquare, dummies])

foursquare.head()

## Save Foursquare Predictions

In [None]:
foursquare.to_csv('foursquare_predictions.csv', header=True, index=False, encoding='UTF8')

df_neutral= foursquare[foursquare.polarity_pred ==1]
df_positive= foursquare[foursquare.polarity_pred ==2]
df_negative= foursquare[foursquare.polarity_pred ==0]

# created to make graphs on tableauWe choose parameters improve our analysis. There are some which can be applied but will not be used because the preprocessing step included them including stop words, lowercase and tokenizer. We will include parameters under a range which will be Grid Searched to give the optimal scores for the specific model in the pipeline. 

Parameters help show us how exactly we want to break down the document.

- max_features : 
The maxiumum number of highest term frequencies across the corpus will only be considered as the vocabulary. 
- norm : (‘l1’, ‘l2’, None)????
Norm used to normalize term vectors. None for no normalization.


- ngram_range :(min_n, max_n) i should only use this if on raw text????
The n-value range of lower and upper boundaries for n-grams to be extracted.
- min_df 
specifies that a word must be used at least twice to be considered. In practice this is useful for removing things like URLs from text, which appear as one offs.

- The max_df is a float value — this tell the vectorizer to ignore words which appears in more than 50% of documents in the corpus. This generally catches words not already defined in the stopwords set.

- We also define a set of stop words. These are words like “a” or “is” which appear so often in a language that we know they won’t provide useful information and so can be ignored.



http://apapiu.github.io/2016-08-04-tf_idf/

- Finally we use fit_transform() to train the vectorizer using the corpus we defined above.


sublinear_tf=True, max_features = 500 

max_features : int or None, default=None
    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

sublinear_tf : boolean, default=False
    Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
TF-IDF: combination of sublinear TF and inverse document frequency???

Term Frequency (TF)
local frequency of a word in the document
i.e. the word is weighed by how many times it occurs in the document
tf(w,d)=∣∣{w′∈d : w′=w}∣∣tf(w,d)=|{w′∈d : w′=w}| where ww is a word and d={w1, ... ,wm}d={w1, ... ,wm} is a document

Sublinear TF:
sometimes a word is used too often so we want to reduce its influence compared to other less frequently used words
for that we can use some sublinear function, e.g.
logtf(w,d)log⁡tf(w,d) or tf(w,d)‾‾‾‾‾‾‾√

http://0agr.ru/wiki/index.php/TF-IDF
df_neutral.to_csv('fouresquare_predictions/df_neutral.csv', header=True, index=False, encoding='UTF8')
df_positive.to_csv('fouresquare_predictions/df_positive.csv', header=True, index=False, encoding='UTF8')
df_negative.to_csv('fouresquare_predictions/df_negative.csv', header=True, index=False, encoding='UTF8')

# Visualisations 

In [None]:
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)

In [None]:
foursquare.groupby(['polarity_pred']).count()

In [None]:
plt.hist(foursquare.polarity_pred, bins = 3, align= 'mid')
plt.xticks(range(3), ['Negative','Neutral', 'Positive'])
plt.xlabel('Predicted Sentiment of Reviews')
plt.title('Distribution of Sentiment of Reviews')
plt.show()

# NOTES

https://stackoverflow.com/questions/40679883/scikit-learn-how-to-include-others-features-after-performed-fit-and-transform-o

In [None]:
later, additonal work, 
- clustering
whats similar to logistic regression
-neural networks
from sklearn.neural_network import MLPClassifier 

We choose parameters improve our analysis. There are some which can be applied but will not be used because the preprocessing step included them including stop words, lowercase and tokenizer. We will include parameters under a range which will be Grid Searched to give the optimal scores for the specific model in the pipeline. 

Parameters help show us how exactly we want to break down the document.

- max_features : 
The maxiumum number of highest term frequencies across the corpus will only be considered as the vocabulary. 
- norm : (‘l1’, ‘l2’, None)????
Norm used to normalize term vectors. None for no normalization.


- ngram_range :(min_n, max_n) i should only use this if on raw text????
The n-value range of lower and upper boundaries for n-grams to be extracted.
- min_df 
specifies that a word must be used at least twice to be considered. In practice this is useful for removing things like URLs from text, which appear as one offs.

- The max_df is a float value — this tell the vectorizer to ignore words which appears in more than 50% of documents in the corpus. This generally catches words not already defined in the stopwords set.

- We also define a set of stop words. These are words like “a” or “is” which appear so often in a language that we know they won’t provide useful information and so can be ignored.



http://apapiu.github.io/2016-08-04-tf_idf/

- Finally we use fit_transform() to train the vectorizer using the corpus we defined above.


sublinear_tf=True, max_features = 500 

max_features : int or None, default=None
    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

sublinear_tf : boolean, default=False
    Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
TF-IDF: combination of sublinear TF and inverse document frequency???

Term Frequency (TF)
local frequency of a word in the document
i.e. the word is weighed by how many times it occurs in the document
tf(w,d)=∣∣{w′∈d : w′=w}∣∣tf(w,d)=|{w′∈d : w′=w}| where ww is a word and d={w1, ... ,wm}d={w1, ... ,wm} is a document

Sublinear TF:
sometimes a word is used too often so we want to reduce its influence compared to other less frequently used words
for that we can use some sublinear function, e.g.
logtf(w,d)log⁡tf(w,d) or tf(w,d)‾‾‾‾‾‾‾√

http://0agr.ru/wiki/index.php/TF-IDF

## k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

# iterate through folds 5-10
for folds in range(5,11):
    print '------------------------------------\n'
    print 'K:', folds
    
    # Perform cross-validation
    scores = cross_val_score(lr, X[predictors], y, cv=folds)
    print "Cross-validated scores:", scores
    print "Mean CV R2:", np.mean(scores)
    print 'Std CV R2:', np.std(scores)
    
    # Make cross-validated predictions
    predictions = cross_val_predict(model, X[predictors], y, cv=folds)
    
    r2 = metrics.r2_score(y, predictions)
    print "Cross-Predicted R2:", r2