# Домашнее задание к лекции "Статистика. Практика"

## Задание 1

Вернемся к [набору данных о видеоиграх](https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/vgsales.csv).

Ответьте на следующие вопросы:

1) Как критики относятся к спортивным играм?  
2) Критикам нравятся больше игры на PC или на PS4?  
3) Критикам больше нравятся стрелялки или стратегии?  

Для каждого вопроса:
- сформулируйте нулевую и альтернативную гипотезы;
- выберите пороговый уровень статистической значимости;
- опишите полученные результаты статистического теста.

In [58]:
import pandas as pd
import numpy as np
from scipy import stats
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import gensim
from gensim import corpora
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split



In [17]:
vg_df = pd.read_csv('https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/stat_case_study/vgsales.csv')
vg_df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [27]:
scores = vg_df[np.logical_not(vg_df['Critic_Score'].isna())]['Critic_Score']
scores.mean()

68.96767850559173

In [4]:
# 1. 1) Как критики относятся к спортивным играм?
# Критики ставят играм в основом оценки выше среднего (больше 69 баллов)
# H0 - средняя оценка игр критиками <= 69
# HA - средняя оценка игр критиками > 69
# alpha = 0.05 пороговый уровень статистической значимости


In [29]:
alpha = 0.05
result = stats.ttest_1samp(scores, 69.0, alternative='greater')

print(result)

if result.pvalue < alpha: 
    print('Отвергаем нулевую гипотезу, средняя оценка игр критиками выше 69 баллов')
else:
    print('Не отвергаем нулевую нулевую гипотезу, средняя оценка игр критиками меньше 69 баллов')

Ttest_1sampResult(statistic=-0.20917896173357323, pvalue=0.5828431139099027)
Не отвергаем нулевую нулевую гипотезу, средняя оценка игр критиками меньше 69 баллов


In [30]:
# 2) Критикам нравятся больше игры на PC или на PS4?
# Средняя оценка игр на PC и на PS4 одинакова?
# H0: средняя оценка игр на PC и на PS4 одинакова
# H1: средняя оценка игр на PC и на PS4 различается
# alpha = 0.05 пороговый уровень статистической значимости

In [31]:
PC_scores = vg_df[np.logical_and(np.logical_not(vg_df['Critic_Score'].isna()),
                                    vg_df['Platform'] == 'PC')]['Critic_Score']

PS4_scores = vg_df[np.logical_and(np.logical_not(vg_df['Critic_Score'].isna()),
                                    vg_df['Platform'] == 'PS4')]['Critic_Score']

In [32]:
PC_scores.head()

85     86.0
138    93.0
192    88.0
218    93.0
284    96.0
Name: Critic_Score, dtype: float64

In [33]:
PS4_scores.head()

42     97.0
77     82.0
92     83.0
94     85.0
105    87.0
Name: Critic_Score, dtype: float64

In [35]:
result = stats.ttest_ind(PC_scores, PS4_scores, equal_var=False)
print(result)

if (result.pvalue < alpha):
    print('Отвергаем нулевую гипотезу, средняя оценка игр на PC и на PS4 различается')
else:
    print('Не отвергаем нулевую гипотезу, средняя оценка игр на PC и на PS4 одинакова')

Ttest_indResult(statistic=4.3087588262138725, pvalue=2.067249157283479e-05)
Отвергаем нулевую гипотезу, средняя оценка игр на PC и на PS4 различается


In [6]:
# 3) Критикам больше нравятся стрелялки или стратегии?
# Средняя оценка 'стрелялок' и стратегий одинакова?
# H0 - Средняя оценка стрелялок и стратегий одинакова
# HA - Средняя оценка стрелялок и стратегий различается
# alpha = 0.05 пороговый уровень статистической значимости

In [36]:
shoter_scores = vg_df[np.logical_and(np.logical_not(vg_df['Critic_Score'].isna()),
                                    vg_df['Genre'] == 'Shooter')]['Critic_Score']

strategy_scores = vg_df[np.logical_and(np.logical_not(vg_df['Critic_Score'].isna()),
                                    vg_df['Genre'] == 'Strategy')]['Critic_Score']

In [37]:
shoter_scores.head()

29    88.0
32    87.0
34    83.0
35    83.0
36    94.0
Name: Critic_Score, dtype: float64

In [38]:
strategy_scores.head()

218     93.0
582     82.0
1078    90.0
1095    86.0
1128    89.0
Name: Critic_Score, dtype: float64

In [39]:
result = stats.ttest_ind(shoter_scores, strategy_scores, equal_var=False)
print(result)

if (result.pvalue < alpha):
    print('Отвергаем нулевую гипотезу, средняя оценка стрелялок и стратегий различается')
else:
    print('Не отвергаем нулевую гипотезу, средняя оценка стрелялок и стратегий одинакова')

Ttest_indResult(statistic=-2.2972408230640315, pvalue=0.021938989522304823)
Отвергаем нулевую гипотезу, средняя оценка стрелялок и стратегий различается


## Задание 2

Реализуйте базовую модель логистической регрессии для классификации текстовых сообщений (используемые данные [здесь](https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/spam.csv)) по признаку спама. Для этого:

1) Привидите весь текст к нижнему регистру;  
2) Удалите мусорные символы;  
3) Удалите стоп-слова;  
4) Привидите все слова к нормальной форме;  
5) Преобразуйте все сообщения в вектора TF-IDF. Вам поможет следующий код:  

```
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df.Message)
names = tfidf.get_feature_names()
tfidf_matrix = pd.DataFrame(tfidf_matrix.toarray(), columns=names)
```

Можете поэкспериментировать с параметрами [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html);  
6) Разделите данные на тестовые и тренировочные в соотношении 30/70, укажите `random_state=42`. Используйте [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html);  
7) Постройте модель [логистической регрессии](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), укажите `random_state=42`, оцените ее точность на тестовых данных;  
8) Опишите результаты при помощи [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html?highlight=confusion_matrix#sklearn.metrics.confusion_matrix);  
9) Постройте датафрейм, который будет содержать все исходные тексты сообщений, классифицированные неправильно (с указанием фактического и предсказанного).

In [41]:
df = pd.read_csv('https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/stat_case_study/spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [42]:
# 1. Приведите весь текст к нижнему регистру
df['Message'] = df['Message'].str.lower()
df.head()

Unnamed: 0,Category,Message
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


In [43]:
# 2. Удалите мусорные символы
df['words'] = df['Message'].map(lambda x: re.sub('[\W_]+',' ', x))
df.head()

Unnamed: 0,Category,Message,words
0,ham,"go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...
1,ham,ok lar... joking wif u oni...,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say
4,ham,"nah i don't think he goes to usf, he lives aro...",nah i don t think he goes to usf he lives arou...


In [44]:
# 3. Удалите стоп-слова
df['words'] = df['words'].map(lambda x: x.split())
df.head()

Unnamed: 0,Category,Message,words
0,ham,"go until jurong point, crazy.. available only ...","[go, until, jurong, point, crazy, available, o..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,u dun say so early hor... u c already then say...,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"nah i don't think he goes to usf, he lives aro...","[nah, i, don, t, think, he, goes, to, usf, he,..."


In [46]:
stopwords_set = set(stopwords.words('english'))
df['no_stopwords'] = df['words'].map(lambda x: [word for word in x if word not in stopwords_set] )
df.head()

Unnamed: 0,Category,Message,words,no_stopwords
0,ham,"go until jurong point, crazy.. available only ...","[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,u dun say so early hor... u c already then say...,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"nah i don't think he goes to usf, he lives aro...","[nah, i, don, t, think, he, goes, to, usf, he,...","[nah, think, goes, usf, lives, around, though]"


In [47]:
# 4. Приведите все слова к нормальной форме
lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['no_stopwords'].map(lambda x: [lemmatizer.lemmatize(word) for word in x] )
df.head()

Unnamed: 0,Category,Message,words,no_stopwords,lemmatized
0,ham,"go until jurong point, crazy.. available only ...","[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,u dun say so early hor... u c already then say...,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"nah i don't think he goes to usf, he lives aro...","[nah, i, don, t, think, he, goes, to, usf, he,...","[nah, think, goes, usf, lives, around, though]","[nah, think, go, usf, life, around, though]"


In [51]:
df['result_message'] = df['lemmatized'].str.join(sep=' ')
df.head()

Unnamed: 0,Category,Message,words,no_stopwords,lemmatized,result_message
0,ham,"go until jurong point, crazy.. available only ...","[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazy, available, bugis, n...",go jurong point crazy available bugis n great ...
1,ham,ok lar... joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entry, 2, wkly, comp, win, fa, cup, fin...",free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say so early hor... u c already then say...,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say early hor u c already say
4,ham,"nah i don't think he goes to usf, he lives aro...","[nah, i, don, t, think, he, goes, to, usf, he,...","[nah, think, goes, usf, lives, around, though]","[nah, think, go, usf, life, around, though]",nah think go usf life around though


In [52]:
# 5. Преобразуйте все сообщения в вектора TF-IDF
df = df.drop(columns=['Message', 'words', 'no_stopwords', 'lemmatized'])
df.head()

Unnamed: 0,Category,result_message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah think go usf life around though


In [53]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['result_message'])

names = tfidf.get_feature_names()
matrix = pd.DataFrame(tfidf_matrix.toarray(), columns=names)
matrix.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
# 6. Разделите данные на тестовые и тренировочные в соотношении 30/70, укажите random_state=42
df['is_spam'] = (df['Category'] == 'spam') * 1
df.head()

Unnamed: 0,Category,result_message,is_spam
0,ham,go jurong point crazy available bugis n great ...,0
1,ham,ok lar joking wif u oni,0
2,spam,free entry 2 wkly comp win fa cup final tkts 2...,1
3,ham,u dun say early hor u c already say,0
4,ham,nah think go usf life around though,0


In [59]:
X_train, X_test, y_train, y_test = train_test_split(matrix, df['is_spam'], test_size=0.30, random_state=42)

In [60]:
# 7. Постройте модель логистической регрессии, укажите random_state=42, оцените ее точность на тестовых данных
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

LinearDiscriminantAnalysis()

In [62]:
# Точность
accuracy_score(y_test, lda.predict(X_test))

0.9677033492822966

In [65]:
# 8. Опишите результаты при помощи confusion_matrix
confusion_arr = confusion_matrix(y_test, lda.predict(X_test))
confusion_arr

array([[1445,    3],
       [  51,  173]], dtype=int64)

In [67]:
print(f'Верно предсказано {confusion_arr[0,0] + confusion_arr[1,1]} значений')
print(f'Ошибок {confusion_arr[0,1] + confusion_arr[1,0]}')

Верно предсказано 1618 значений
Ошибок 54


In [69]:
# 9. Постройте датафрейм, который будет содержать все исходные тексты сообщений, классифицированные неправильно (с указанием фактического и предсказанного)
errors_df = pd.concat(
    [df[df.index.isin(y_test.index)].reset_index(), 
    pd.Series(lda.predict(X_test), 
    np.arange(len(lda.predict(X_test))),
                              name = 'predict')], axis=1)
                              
errors_df = errors_df[errors_df['is_spam'] != errors_df['predict']]
errors_df

Unnamed: 0,index,Category,result_message,is_spam,predict
0,8,spam,winner valued network customer selected receiv...,1,0
1,12,spam,urgent 1 week free membership 100 000 prize ja...,1,0
2,15,spam,xxxmobilemovieclub use credit click wap link n...,1,0
4,19,spam,england v macedonia dont miss goal team news t...,1,0
14,47,ham,fair enough anything going,0,1
...,...,...,...,...,...
1633,5446,ham,back good journey let know need receipt shall ...,0,1
1634,5450,ham,sac need carry,0,1
1638,5457,ham,arun u transfr amt,0,1
1657,5524,spam,awarded sipix digital camera call 09061221061 ...,1,0


#### ПРИМЕЧАНИЕ
Домашнее задание сдается ссылкой на репозиторий [GitHub](https://github.com/).
Не сможем проверить или помочь, если вы пришлете:
- файлы;
- архивы;
- скриншоты кода.

Все обсуждения и консультации по выполнению домашнего задания ведутся только на соответствующем канале в slack.

##### Как правильно задавать вопросы аспирантам, преподавателям и коллегам?
Прежде чем задать вопрос необходимо попробовать найти ответ самому в интернете. Навык самостоятельного поиска информации – один из важнейших, и каждый практикующий специалист любого уровня это делает каждый день.

Любой вопрос должен быть сформулирован по алгоритму:  
1) Что я делаю?  
2) Какого результата я ожидаю?  
3) Как фактический результат отличается от ожидаемого?  
4) Что я уже попробовал сделать, чтобы исправить проблему?  

По возможности, прикрепляйте к вопросу скриншоты, либо ссылки на код. Оставляйте только проблемный и воспроизводимый участок кода, все решение выкладывать не допускается.
