# DonorsChoose

<p>
DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
</p>
<p>
    Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
<ul>
<li>
    How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible</li>
    <li>How to increase the consistency of project vetting across different volunteers to improve the experience for teachers</li>
    <li>How to focus volunteer time on the applications that need the most assistance</li>
    </ul>
</p>    
<p>
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
</p>

## About the DonorsChoose Data Set

The `train.csv` data set provided by DonorsChoose contains the following features:

Feature | Description 
----------|---------------
**`project_id`** | A unique identifier for the proposed project. **Example:** `p036502`   
**`project_title`**    | Title of the project. **Examples:**<br><ul><li><code>Art Will Make You Happy!</code></li><li><code>First Grade Fun</code></li></ul> 
**`project_grade_category`** | Grade level of students for which the project is targeted. One of the following enumerated values: <br/><ul><li><code>Grades PreK-2</code></li><li><code>Grades 3-5</code></li><li><code>Grades 6-8</code></li><li><code>Grades 9-12</code></li></ul>  
 **`project_subject_categories`** | One or more (comma-separated) subject categories for the project from the following enumerated list of values:  <br/><ul><li><code>Applied Learning</code></li><li><code>Care &amp; Hunger</code></li><li><code>Health &amp; Sports</code></li><li><code>History &amp; Civics</code></li><li><code>Literacy &amp; Language</code></li><li><code>Math &amp; Science</code></li><li><code>Music &amp; The Arts</code></li><li><code>Special Needs</code></li><li><code>Warmth</code></li></ul><br/> **Examples:** <br/><ul><li><code>Music &amp; The Arts</code></li><li><code>Literacy &amp; Language, Math &amp; Science</code></li>  
  **`school_state`** | State where school is located ([Two-letter U.S. postal code](https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations#Postal_codes)). **Example:** `WY`
**`project_subject_subcategories`** | One or more (comma-separated) subject subcategories for the project. **Examples:** <br/><ul><li><code>Literacy</code></li><li><code>Literature &amp; Writing, Social Sciences</code></li></ul> 
**`project_resource_summary`** | An explanation of the resources needed for the project. **Example:** <br/><ul><li><code>My students need hands on literacy materials to manage sensory needs!</code</li></ul> 
**`project_essay_1`**    | First application essay<sup>*</sup>  
**`project_essay_2`**    | Second application essay<sup>*</sup> 
**`project_essay_3`**    | Third application essay<sup>*</sup> 
**`project_essay_4`**    | Fourth application essay<sup>*</sup> 
**`project_submitted_datetime`** | Datetime when project application was submitted. **Example:** `2016-04-28 12:43:56.245`   
**`teacher_id`** | A unique identifier for the teacher of the proposed project. **Example:** `bdf8baa8fedef6bfeec7ae4ff1c15c56`  
**`teacher_prefix`** | Teacher's title. One of the following enumerated values: <br/><ul><li><code>nan</code></li><li><code>Dr.</code></li><li><code>Mr.</code></li><li><code>Mrs.</code></li><li><code>Ms.</code></li><li><code>Teacher.</code></li></ul>  
**`teacher_number_of_previously_posted_projects`** | Number of project applications previously submitted by the same teacher. **Example:** `2` 

<sup>*</sup> See the section <b>Notes on the Essay Data</b> for more details about these features.

Additionally, the `resources.csv` data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature | Description 
----------|---------------
**`id`** | A `project_id` value from the `train.csv` file.  **Example:** `p036502`   
**`description`** | Desciption of the resource. **Example:** `Tenor Saxophone Reeds, Box of 25`   
**`quantity`** | Quantity of the resource required. **Example:** `3`   
**`price`** | Price of the resource required. **Example:** `9.95`   

**Note:** Many projects require multiple resources. The `id` value corresponds to a `project_id` in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label | Description
----------|---------------
`project_is_approved` | A binary flag indicating whether DonorsChoose approved the project. A value of `0` indicates the project was not approved, and a value of `1` indicates the project was approved.

### Notes on the Essay Data

<ul>
Prior to May 17, 2016, the prompts for the essays were as follows:
<li>__project_essay_1:__ "Introduce us to your classroom"</li>
<li>__project_essay_2:__ "Tell us more about your students"</li>
<li>__project_essay_3:__ "Describe how your students will use the materials you're requesting"</li>
<li>__project_essay_3:__ "Close by sharing why your project will make a difference"</li>
</ul>


<ul>
Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:<br>
<li>__project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."</li>
<li>__project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"</li>
<br>For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.
</ul>


In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter

matplotlib.rc("lines", markeredgewidth=0.5)

Output hidden; open in https://colab.research.google.com to view.

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## 1.1 Reading Data

In [0]:
project_data = pd.read_csv('/content/drive/My Drive/Assignment2/train_data.csv')
resource_data = pd.read_csv('/content/drive/My Drive/Assignment2/resources.csv')
#test_data = pd.read_csv('/content/drive/My Drive/Assignment2/test_data.csv')

In [0]:
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)

Number of data points in train data (109248, 17)
--------------------------------------------------
The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state'
 'project_submitted_datetime' 'project_grade_category'
 'project_subject_categories' 'project_subject_subcategories'
 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3'
 'project_essay_4' 'project_resource_summary'
 'teacher_number_of_previously_posted_projects' 'project_is_approved']


In [4]:
# how to replace elements in list python: https://stackoverflow.com/a/2582163/4084039
cols = ['Date' if x=='project_submitted_datetime' else x for x in list(project_data.columns)]


#sort dataframe based on time pandas python: https://stackoverflow.com/a/49702492/4084039
project_data['Date'] = pd.to_datetime(project_data['project_submitted_datetime'])
project_data.drop('project_submitted_datetime', axis=1, inplace=True)
project_data.sort_values(by=['Date'], inplace=True)


# how to reorder columns pandas python: https://stackoverflow.com/a/13148611/4084039
project_data = project_data[cols]

project_data.head(2)

Unnamed: 0.1,Unnamed: 0,id,teacher_id,teacher_prefix,school_state,Date,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
55660,8393,p205479,2bf07ba08945e5d8b2a3f269b2b3cfe5,Mrs.,CA,2016-04-27 00:27:36,Grades PreK-2,Math & Science,"Applied Sciences, Health & Life Science",Engineering STEAM into the Primary Classroom,I have been fortunate enough to use the Fairy ...,My students come from a variety of backgrounds...,Each month I try to do several science or STEM...,It is challenging to develop high quality scie...,My students need STEM kits to learn critical s...,53,1
76127,37728,p043609,3f60494c61921b3b43ab61bdde2904df,Ms.,UT,2016-04-27 00:31:25,Grades 3-5,Special Needs,Special Needs,Sensory Tools for Focus,Imagine being 8-9 years old. You're in your th...,"Most of my students have autism, anxiety, anot...",It is tough to do more than one thing at a tim...,When my students are able to calm themselves d...,My students need Boogie Boards for quiet senso...,4,1


In [0]:
## Test Data: Convert the 'project_submitted_datetime' into datetime type
'''
cols = ['Date' if x=='project_submitted_datetime' else x for x in list(test_data.columns)]

test_data['Date'] = pd.to_datetime(test_data['project_submitted_datetime'])
test_data.drop('project_submitted_datetime', axis=1, inplace=True)

test_data = test_data[cols]

test_data.head(2)'''

In [0]:
print("Number of data points in the resources data", resource_data.shape)
print('-'*50)
print(resource_data.columns.values)
resource_data.head(2)

Number of data points in the resources data (1541272, 4)
--------------------------------------------------
['id' 'description' 'quantity' 'price']


Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95


In [0]:
'''
print("Number of data points in test data", test_data.shape)
print('-'*50)
print(test_data.columns.values)
test_data.head(2)'''

## 1.2 preprocessing of `project_subject_categories`

In [0]:
catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)

from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
    my_counter.update(word.split())

cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))


In [0]:
'''
## preprocessing project_subject_categories for test data
catogories = list(test_data['project_subject_categories'].values)

cat_list = []
for i in catogories:
    temp = ""
    for j in i.split(','): 
        if 'The' in j.split(): 
            j=j.replace('The','') 
        j = j.replace(' ','') 
        temp+=j.strip()+" " 
        temp = temp.replace('&','_')
    cat_list.append(temp.strip())
    
test_data['clean_categories'] = cat_list
test_data.drop(['project_subject_categories'], axis=1, inplace=True)'''

## 1.3 preprocessing of `project_subject_subcategories`

In [0]:
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)

# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
    my_counter.update(word.split())
    
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))

In [0]:
'''
## preprocessing project_subject_subcategories for test data
sub_catogories = list(test_data['project_subject_subcategories'].values)

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    for j in i.split(','): 
        if 'The' in j.split(): 
            j=j.replace('The','') 
        j = j.replace(' ','') 
        temp +=j.strip()+" "
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

test_data['clean_subcategories'] = sub_cat_list
test_data.drop(['project_subject_subcategories'], axis=1, inplace=True)'''


## 1.4 preprocessing of price attribute of resources data 

In [0]:
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
project_data = pd.merge(project_data, price_data, on='id', how='left')

## 1.4 Text preprocessing requisites

In [0]:
# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)

In [0]:
project_data.head(2)

Unnamed: 0.1,Unnamed: 0,id,teacher_id,teacher_prefix,school_state,Date,project_grade_category,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved,clean_categories,clean_subcategories,price,quantity,essay
0,8393,p205479,2bf07ba08945e5d8b2a3f269b2b3cfe5,Mrs.,CA,2016-04-27 00:27:36,Grades PreK-2,Engineering STEAM into the Primary Classroom,I have been fortunate enough to use the Fairy ...,My students come from a variety of backgrounds...,Each month I try to do several science or STEM...,It is challenging to develop high quality scie...,My students need STEM kits to learn critical s...,53,1,Math_Science,AppliedSciences Health_LifeScience,725.05,4,I have been fortunate enough to use the Fairy ...
1,37728,p043609,3f60494c61921b3b43ab61bdde2904df,Ms.,UT,2016-04-27 00:31:25,Grades 3-5,Sensory Tools for Focus,Imagine being 8-9 years old. You're in your th...,"Most of my students have autism, anxiety, anot...",It is tough to do more than one thing at a tim...,When my students are able to calm themselves d...,My students need Boogie Boards for quiet senso...,4,1,SpecialNeeds,SpecialNeeds,213.03,8,Imagine being 8-9 years old. You're in your th...


In [0]:
'''
test_data["essay"] = test_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)

test_data.head(2)'''

In [0]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [0]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]

<h1>2. Naive Bayes</h1>

##2.1 Splitting data into Train and cross validation(or test): Stratified Sampling

In [11]:
# please write all the code with proper documentation, and proper titles for each subsection
# go through documentations and blogs before you start coding
# first figure out what to do, and then think about how to do.
# reading and understanding error messages will be very much helpfull in debugging your code
# when you plot any graph make sure you use 
    # a. Title, that describes your plot, this will be very helpful to the reader
    # b. Legends if needed
    # c. X-axis label
    # d. Y-axis label

## Split into train and cross-validation datasets
from sklearn.model_selection import train_test_split

#train, cv = train_test_split(project_data, test_size=0.3, random_state = 42)

#train.columns

### split into train, cv and test in 60:20:20 ratio (time based splitting)

train_len = int((len(project_data)*0.6))
cv_len = int((len(project_data) - train_len)/2)
test_len = (len(project_data) - (train_len+cv_len))

#print(train_len, cv_len, test_len)

train = project_data.iloc[:train_len, :]
cv = project_data.iloc[train_len:train_len+cv_len, :]
test = project_data.iloc[train_len+cv_len:, :]

print("Train Data Shape:", train.shape)
print("Cross-validation Data Shape:", cv.shape)
print("Test Data Shape:", test.shape)

train.columns

Train Data Shape: (65548, 20)
Cross-validation Data Shape: (21850, 20)
Test Data Shape: (21850, 20)


Index(['Unnamed: 0', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'Date', 'project_grade_category', 'project_title', 'project_essay_1',
       'project_essay_2', 'project_essay_3', 'project_essay_4',
       'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'project_is_approved',
       'clean_categories', 'clean_subcategories', 'price', 'quantity',
       'essay'],
      dtype='object')

In [12]:
### Perform Upsampling to overcome Class Imbalance problem:

from sklearn.utils import resample

train_pos = train[train['project_is_approved'] == 1]
train_neg = train[train['project_is_approved'] == 0]

train['project_is_approved'].value_counts()

1    55196
0    10352
Name: project_is_approved, dtype: int64

In [13]:
## upsample minority class: https://elitedatascience.com/imbalanced-classes
## minority class - negative class

train_neg_upsampled = resample(train_neg, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(train_pos),    # to match majority class
                                 random_state=42) 
 
# Combine majority class with upsampled minority class
train_upsampled = pd.concat([train_pos, train_neg_upsampled])

train_upsampled['project_is_approved'].value_counts()


1    55196
0    55196
Name: project_is_approved, dtype: int64

##2.2 Make Data Model Ready: encoding numerical, categorical features

### 2.2.1 Encoding Categorical Features

#### 2.2.1.1 clean_categories

In [14]:
# we use count vectorizer to convert the values into one 
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=list(sorted_cat_dict.keys()), lowercase=False, binary=True)
categories_one_hot = vectorizer.fit_transform(train_upsampled['clean_categories'].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",categories_one_hot.shape)

# cross-validation data
cv_categories_one_hot = vectorizer.transform(cv['clean_categories'].values)
print("CV : Shape of matrix after one hot encoding ",cv_categories_one_hot.shape)

# test data
test_categories_one_hot = vectorizer.transform(test['clean_categories'].values)
print("test : Shape of matrix after one hot encoding ",test_categories_one_hot.shape)

['Warmth', 'Care_Hunger', 'History_Civics', 'Music_Arts', 'AppliedLearning', 'SpecialNeeds', 'Health_Sports', 'Math_Science', 'Literacy_Language']
Shape of matrix after one hot encodig  (110392, 9)
CV : Shape of matrix after one hot encoding  (21850, 9)
test : Shape of matrix after one hot encoding  (21850, 9)


#### 2.2.1.2 clean_subcategories

In [15]:
# we use count vectorizer to convert the values into one 
vectorizer = CountVectorizer(vocabulary=list(sorted_sub_cat_dict.keys()), lowercase=False, binary=True)
sub_categories_one_hot = vectorizer.fit_transform(train_upsampled['clean_subcategories'].values)
print(vectorizer.get_feature_names())
print("Shape of matrix after one hot encodig ",sub_categories_one_hot.shape)

# cross-validation data
cv_sub_categories_one_hot = vectorizer.transform(cv['clean_subcategories'].values)
print("CV : Shape of matrix after one hot encoding ",cv_sub_categories_one_hot.shape)

# test data
test_sub_categories_one_hot = vectorizer.transform(test['clean_subcategories'].values)
print("test : Shape of matrix after one hot encoding ",test_sub_categories_one_hot.shape)

['Economics', 'CommunityService', 'FinancialLiteracy', 'ParentInvolvement', 'Extracurricular', 'Civics_Government', 'ForeignLanguages', 'NutritionEducation', 'Warmth', 'Care_Hunger', 'SocialSciences', 'PerformingArts', 'CharacterEducation', 'TeamSports', 'Other', 'College_CareerPrep', 'Music', 'History_Geography', 'Health_LifeScience', 'EarlyDevelopment', 'ESL', 'Gym_Fitness', 'EnvironmentalScience', 'VisualArts', 'Health_Wellness', 'AppliedSciences', 'SpecialNeeds', 'Literature_Writing', 'Mathematics', 'Literacy']
Shape of matrix after one hot encodig  (110392, 30)
CV : Shape of matrix after one hot encoding  (21850, 30)
test : Shape of matrix after one hot encoding  (21850, 30)


#### 2.2.1.3 school_state, teacher_prefix and project_grade_category

In [16]:
# you can do the similar thing with state, teacher_prefix and project_grade_category also
## school_state
vectorizer = CountVectorizer(vocabulary=list(train_upsampled['school_state'].unique()), lowercase=False, binary=True)
vectorizer.fit(train['school_state'].values)
print(vectorizer.get_feature_names())

school_state_one_hot = vectorizer.transform(train_upsampled['school_state'].values)
print("Shape of matrix after one hot encodig ",school_state_one_hot.shape)

# cross-validation data
cv_school_state_one_hot = vectorizer.transform(cv['school_state'].values)
print("CV: Shape of matrix after one hot encoding ",cv_school_state_one_hot.shape)

# test data
test_school_state_one_hot = vectorizer.transform(test['school_state'].values)
print("test: Shape of matrix after one hot encoding ",test_school_state_one_hot.shape)

## teacher_prefix
tl = list(train_upsampled['teacher_prefix'].unique())
del tl[4]
vectorizer = CountVectorizer(vocabulary=tl, lowercase=False, binary=True)
vectorizer.fit(train_upsampled['teacher_prefix'].values.astype(str))
print("\n" + str(vectorizer.get_feature_names()))

teacher_prefix_one_hot = vectorizer.transform(train_upsampled['teacher_prefix'].values.astype(str))
print("Shape of matrix after one hot encodig ",teacher_prefix_one_hot.shape)

# cross-validation data
cv_teacher_prefix_one_hot = vectorizer.transform(cv['teacher_prefix'].values.astype(str))
print("CV: Shape of matrix after one hot encoding ",cv_teacher_prefix_one_hot.shape)

# test data
test_teacher_prefix_one_hot = vectorizer.transform(test['teacher_prefix'].values.astype(str))
print("test: Shape of matrix after one hot encoding ",test_teacher_prefix_one_hot.shape)


## project_grade_category
vectorizer = CountVectorizer(vocabulary=list(train_upsampled['project_grade_category'].unique()), lowercase=False, binary=True)
vectorizer.fit(train_upsampled['project_grade_category'].values)
print("\n" + str(vectorizer.get_feature_names()))

project_grade_category_one_hot = vectorizer.transform(train_upsampled['project_grade_category'].values)
print("Shape of matrix after one hot encodig ",project_grade_category_one_hot.shape)

# cross-validation data
cv_project_grade_category_one_hot = vectorizer.transform(cv['project_grade_category'].values)
print("CV: Shape of matrix after one hot encoding ",cv_project_grade_category_one_hot.shape)

# test data
test_project_grade_category_one_hot = vectorizer.transform(test['project_grade_category'].values)
print("test: Shape of matrix after one hot encoding ",test_project_grade_category_one_hot.shape)



['CA', 'UT', 'GA', 'WA', 'HI', 'IL', 'OH', 'KY', 'SC', 'MO', 'MI', 'NY', 'VA', 'MD', 'TX', 'MS', 'AZ', 'OK', 'PA', 'WV', 'NC', 'CO', 'DC', 'MA', 'FL', 'ID', 'AL', 'ME', 'TN', 'IN', 'NJ', 'LA', 'CT', 'AR', 'OR', 'WI', 'IA', 'AK', 'MN', 'NM', 'MT', 'KS', 'NV', 'RI', 'WY', 'SD', 'NH', 'NE', 'DE', 'ND', 'VT']
Shape of matrix after one hot encodig  (110392, 51)
CV: Shape of matrix after one hot encoding  (21850, 51)
test: Shape of matrix after one hot encoding  (21850, 51)

['Mrs.', 'Ms.', 'Mr.', 'Teacher', 'Dr.']
Shape of matrix after one hot encodig  (110392, 5)
CV: Shape of matrix after one hot encoding  (21850, 5)
test: Shape of matrix after one hot encoding  (21850, 5)

['Grades PreK-2', 'Grades 3-5', 'Grades 9-12', 'Grades 6-8']
Shape of matrix after one hot encodig  (110392, 4)
CV: Shape of matrix after one hot encoding  (21850, 4)
test: Shape of matrix after one hot encoding  (21850, 4)


### 2.2.2 Encoding Numerical Features

#### 2.2.2.1 teacher_number_of_previously_posted_projects

In [17]:
from sklearn.preprocessing import StandardScaler

# standardizing the attribute 'teacher_number_of_previously_posted_projects'
teacher_prev_proj_scalar = StandardScaler()
teacher_prev_proj_scalar.fit(train_upsampled['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {teacher_prev_proj_scalar.mean_[0]}, Standard deviation : {np.sqrt(teacher_prev_proj_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
teacher_prev_proj_standardized = teacher_prev_proj_scalar.transform(train_upsampled['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))

teacher_prev_proj_wo_std = train_upsampled['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)

print("Shape of teacher_previous_projects:", teacher_prev_proj_wo_std.shape)

Mean : 7.980206899050656, Standard deviation : 21.450536452403668
Shape of teacher_previous_projects: (110392, 1)


In [18]:
## Cross Validation
# standardizing the attribute 'teacher_number_of_previously_posted_projects'
teacher_prev_proj_scalar = StandardScaler()
teacher_prev_proj_scalar.fit(cv['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {teacher_prev_proj_scalar.mean_[0]}, Standard deviation : {np.sqrt(teacher_prev_proj_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
cv_teacher_prev_proj_standardized = teacher_prev_proj_scalar.transform(cv['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))

cv_teacher_prev_proj_wo_std = cv['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)
print("Shape of teacher_previous_projects:", cv_teacher_prev_proj_wo_std.shape)

Mean : 13.671762013729976, Standard deviation : 31.88837788699401
Shape of teacher_previous_projects: (21850, 1)


In [19]:
## Test Data
# standardizing the attribute 'teacher_number_of_previously_posted_projects'
teacher_prev_proj_scalar = StandardScaler()
teacher_prev_proj_scalar.fit(test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {teacher_prev_proj_scalar.mean_[0]}, Standard deviation : {np.sqrt(teacher_prev_proj_scalar.var_[0])}")

# Now standardize the data with above mean and variance.
test_teacher_prev_proj_standardized = teacher_prev_proj_scalar.transform(test['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))

test_teacher_prev_proj_wo_std = test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)
print("Shape of teacher_previous_projects:", test_teacher_prev_proj_wo_std.shape)

Mean : 14.287826086956521, Standard deviation : 32.4711136276059
Shape of teacher_previous_projects: (21850, 1)


#### 2.2.2.1 price

In [20]:
price_scalar = StandardScaler()
price_scalar.fit(train_upsampled['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
price_standardized = price_scalar.transform(train_upsampled['price'].values.reshape(-1, 1))

price_wo_std = train_upsampled['price'].values.reshape(-1,1)
print("Shape of price:", price_wo_std.shape)

Mean : 337.17947940068115, Standard deviation : 377.81493126721455
Shape of price: (110392, 1)


In [21]:
price_scalar = StandardScaler()
price_scalar.fit(cv['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
cv_price_standardized = price_scalar.transform(cv['price'].values.reshape(-1, 1))

cv_price_wo_std = cv['price'].values.reshape(-1,1)
print("Shape of price:", cv_price_wo_std.shape)

Mean : 289.61196292906175, Standard deviation : 383.7327163345801
Shape of price: (21850, 1)


In [22]:
price_scalar = StandardScaler()
price_scalar.fit(test['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
test_price_standardized = price_scalar.transform(test['price'].values.reshape(-1, 1))

test_price_wo_std = test['price'].values.reshape(-1,1)
print("Shape of price:", test_price_wo_std.shape)

Mean : 270.78274462242564, Standard deviation : 351.373759381503
Shape of price: (21850, 1)


##2.3 Make Data Model Ready: encoding essay, and project_title

### 2.3.1 Encoding Text Features

#### 2.3.1.1 Preprocessing essays and project_title

In [23]:
## Preprocess essays and project_titles for train, cv and test_data

# Combining all the above stundents 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(train_upsampled['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_essays.append(sent.lower().strip())

cv_preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(cv['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    cv_preprocessed_essays.append(sent.lower().strip())
    
test_preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(test['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    test_preprocessed_essays.append(sent.lower().strip())
    
# similarly you can preprocess the titles also
preprocessed_titles = []
# tqdm is for printing the status bar
for sentance in tqdm(train_upsampled['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_titles.append(sent.lower().strip())
    
cv_preprocessed_titles = []
# tqdm is for printing the status bar
for sentance in tqdm(cv['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    cv_preprocessed_titles.append(sent.lower().strip())
    
test_preprocessed_titles = []
# tqdm is for printing the status bar
for sentance in tqdm(test['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    test_preprocessed_titles.append(sent.lower().strip())

100%|██████████| 110392/110392 [01:07<00:00, 1638.28it/s]
100%|██████████| 21850/21850 [00:13<00:00, 1594.87it/s]
100%|██████████| 21850/21850 [00:13<00:00, 1605.46it/s]
100%|██████████| 110392/110392 [00:03<00:00, 34957.81it/s]
100%|██████████| 21850/21850 [00:00<00:00, 34639.01it/s]
100%|██████████| 21850/21850 [00:00<00:00, 34392.06it/s]


#### 2.3.1.2 Bag of Words:

##### 2.3.1.2.1 essays

In [35]:
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer = CountVectorizer(min_df=10)
text_bow = vectorizer.fit_transform(preprocessed_essays)
print("Shape of matrix after one hot encoding ",text_bow.shape)

cv_text_bow = vectorizer.transform(cv_preprocessed_essays)
print("CV: Shape of matrix after one hot encoding ",cv_text_bow.shape)

test_text_bow = vectorizer.transform(test_preprocessed_essays)
print("test: Shape of matrix after one hot encoding ",test_text_bow.shape)


Shape of matrix after one hot encoding  (110392, 16749)
CV: Shape of matrix after one hot encoding  (21850, 16749)
test: Shape of matrix after one hot encoding  (21850, 16749)


In [38]:
bow_train_features = vectorizer.get_feature_names()
print(len(bow_train_features))

16749


##### 2.3.1.2.2 project_title

In [39]:
# project_title
# before you vectorize the title make sure you preprocess it
vectorizer = CountVectorizer()
title_bow = vectorizer.fit_transform(preprocessed_titles)
print("Shape of matrix after one hot encoding ",title_bow.shape)

cv_title_bow = vectorizer.transform(cv_preprocessed_titles)
print("Shape of matrix after one hot encoding ",cv_title_bow.shape)

test_title_bow = vectorizer.transform(test_preprocessed_titles)
print("Shape of matrix after one hot encoding ",test_title_bow.shape)

Shape of matrix after one hot encoding  (110392, 13019)
Shape of matrix after one hot encoding  (21850, 13019)
Shape of matrix after one hot encoding  (21850, 13019)


In [40]:
bow_train_features = bow_train_features + vectorizer.get_feature_names()
print(len(bow_train_features))

29768


In [0]:
with open('/content/drive/My Drive/Assignment2/bow_train_features.pkl', 'wb') as f:
  pickle.dump(bow_train_features, f)

In [31]:
# concatenate title and essay text BOW 
'''from scipy.sparse import hstack

text_features_bow = hstack((title_bow, text_bow))

print("Text Features Dimensions:", text_features_bow.shape)'''

Text Features Dimensions: (110392, 29768)


In [0]:
## Save the BOW matrices

#from scipy import sparse

#sparse.save_npz("/content/drive/My Drive/Assignment2/text_features_bow.npz", text_features_bow)
#sparse.save_npz("/content/drive/My Drive/Assignment2/cv_title_bow.npz", cv_text_bow)
#sparse.save_npz("/content/drive/My Drive/Assignment2/test_title_bow.npz", test_text_bow)
#sparse.save_npz("/content/drive/My Drive/Assignment2/title_bow.npz", title_bow)
#sparse.save_npz("/content/drive/My Drive/Assignment2/cv_title_bow.npz", cv_title_bow)
#sparse.save_npz("/content/drive/My Drive/Assignment2/test_title_bow.npz", test_title_bow)
#your_matrix_back = sparse.load_npz("yourmatrix.npz")


#### 2.3.1.3 TF_IDF Vectorizer

##### 2.3.1.3.1 essays

In [41]:
## Essays
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10)
text_tfidf = vectorizer.fit_transform(preprocessed_essays)
print("Shape of matrix after one hot encoding ",text_tfidf.shape)

cv_text_tfidf = vectorizer.transform(cv_preprocessed_essays)
print("CV: Shape of matrix after one hot encoding ",cv_text_tfidf.shape)

test_text_tfidf = vectorizer.transform(test_preprocessed_essays)
print("test: Shape of matrix after one hot encoding ",test_text_tfidf.shape)

Shape of matrix after one hot encoding  (110392, 16749)
CV: Shape of matrix after one hot encoding  (21850, 16749)
test: Shape of matrix after one hot encoding  (21850, 16749)


##### 2.3.1.3.2 project_title

In [43]:
## project_title
vectorizer = TfidfVectorizer()
title_tfidf = vectorizer.fit_transform(preprocessed_titles)
print("Shape of matrix after one hot encoding ",title_tfidf.shape)

cv_title_tfidf = vectorizer.transform(cv_preprocessed_titles)
print("CV: Shape of matrix after one hot encoding ",cv_title_tfidf.shape)

test_title_tfidf = vectorizer.transform(test_preprocessed_titles)
print("test: Shape of matrix after one hot encoding ",test_title_tfidf.shape)

Shape of matrix after one hot encoding  (110392, 13019)
CV: Shape of matrix after one hot encoding  (21850, 13019)
test: Shape of matrix after one hot encoding  (21850, 13019)


In [0]:
tfidf_train_features = tfidf_train_features + vectorizer.get_feature_names()

In [0]:
with open('/content/drive/My Drive/Assignment2/tfidf_train_features.pkl', 'wb') as f:
  pickle.dump(tfidf_train_features, f)

In [45]:
print(len(tfidf_train_features))

29768


In [33]:
# concatenate title and essay text TFIDF
'''
text_features_tfidf = hstack((title_tfidf, text_tfidf))

print("Text Features Dimensions:", text_features_tfidf.shape)

sparse.save_npz("/content/drive/My Drive/Assignment2/text_features_tfidf.npz", text_features_tfidf)
'''

Text Features Dimensions: (110392, 29768)


In [0]:
'''
sparse.save_npz("/content/drive/My Drive/Assignment2/title_bow.npz", text_tfidf)
sparse.save_npz("/content/drive/My Drive/Assignment2/cv_title_bow.npz", cv_text_tfidf)
sparse.save_npz("/content/drive/My Drive/Assignment2/test_title_bow.npz", test_text_tfidf)
sparse.save_npz("/content/drive/My Drive/Assignment2/title_bow.npz", title_tfidf)
sparse.save_npz("/content/drive/My Drive/Assignment2/cv_title_bow.npz", cv_title_tfidf)
sparse.save_npz("/content/drive/My Drive/Assignment2/test_title_bow.npz", test_title_tfidf)
#your_matrix_back = sparse.load_npz("yourmatrix.npz")
'''

##2.4 Appling Naive Bayes on different kind of featurization as mentioned in the instructions

<br>Apply Multinomial Naive Bayes on different kind of featurization as mentioned in the instructions
<br> For Every model that you work on make sure you do the step 2 and step 3 of instructions

### 2.4.1 Applying Multinomial Naive Bayes on BOW,<font color='red'> SET 1</font>

#### 2.4.1.1 Form the Data Matrix, Standardize

In [0]:
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039

# with the same hstack function we are concatinating a sparse matrix and a dense matirx :)

# train data
X_train_up = hstack((categories_one_hot, sub_categories_one_hot, school_state_one_hot, teacher_prefix_one_hot, price_wo_std, teacher_prev_proj_wo_std, title_bow, text_bow))
Y_train_up = train_upsampled['project_is_approved']
print("Data Matrix Dimensions:", X_train_up.shape)
print("Target Variable Dimensions:", Y_train_up.shape)

# cross-validation data
X_cv_up = hstack((cv_categories_one_hot, cv_sub_categories_one_hot, cv_school_state_one_hot, cv_teacher_prefix_one_hot, cv_price_wo_std, cv_teacher_prev_proj_wo_std, cv_title_bow, cv_text_bow))
Y_cv_up = cv['project_is_approved']
print("CV Data Matrix Dimensions:", X_cv_up.shape)
print("CV Target Variable Dimensions:", Y_cv_up.shape)

# test_data
X_test_up = hstack((test_categories_one_hot, test_sub_categories_one_hot, test_school_state_one_hot, test_teacher_prefix_one_hot, test_price_wo_std, test_teacher_prev_proj_wo_std, test_title_bow, test_text_bow))
Y_test_up = test['project_is_approved']
print("Test Data Matrix Dimensions:", X_test_up.shape)
print("Test Target Variable Dimensions:", Y_test_up.shape)


Data Matrix Dimensions: (110392, 29865)
Target Variable Dimensions: (110392,)
CV Data Matrix Dimensions: (21850, 29865)
CV Target Variable Dimensions: (21850,)
Test Data Matrix Dimensions: (21850, 29865)
Test Target Variable Dimensions: (21850,)


In [0]:
## Standardize the Data Matrices
'''
scaler = StandardScaler(with_mean=False)
scaler.fit(X_train) # finding the mean and standard deviation of this data
print(f"Mean : {scaler.mean_[0]}, Standard deviation : {np.sqrt(scaler.var_[0])}")
# Now standardize the train data with above mean and variance.
X_train_std = scaler.transform(X_train)

scaler.fit(X_cv) # finding the mean and standard deviation of this data
print(f"Mean : {scaler.mean_[0]}, Standard deviation : {np.sqrt(scaler.var_[0])}")
# Now standardize the data with above mean and variance.
X_cv_std = scaler.transform(X_cv)

scaler.fit(X_test) # finding the mean and standard deviation of this data
print(f"Mean : {scaler.mean_[0]}, Standard deviation : {np.sqrt(scaler.var_[0])}")
# Now standardize the data with above mean and variance.
X_test_std = scaler.transform(X_test)'''

In [0]:
## Storing Data Matrix X values

with open('/content/drive/My Drive/Assignment2/X_train_up_bow', 'wb') as f:
  pickle.dump(X_train_up, f)

with open('/content/drive/My Drive/Assignment2/X_cv_up_bow', 'wb') as f:
  pickle.dump(X_cv_up, f)
  
with open('/content/drive/My Drive/Assignment2/X_test_up_bow', 'wb') as f:
  pickle.dump(X_test_up, f)
  

## Storing Target Variable Y values: 

with open('/content/drive/My Drive/Assignment2/Y_train_up', 'wb') as f:
  pickle.dump(Y_train_up, f)
  

with open('/content/drive/My Drive/Assignment2/Y_cv_up', 'wb') as f:
  pickle.dump(Y_cv_up, f)
  
with open('/content/drive/My Drive/Assignment2/Y_test_up', 'wb') as f:
  pickle.dump(Y_test_up, f)

### 2.4.2 Applying Multinomial Naive Bayes on TFIDF,<font color='red'> SET 2</font>

#### 2.4.2.1 Form the Data Matrix

In [0]:
# Please write all the code with proper documentation
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
# with the same hstack function we are concatinating a sparse matrix and a dense matirx :)

# train data
X_train_up = hstack((categories_one_hot, sub_categories_one_hot, school_state_one_hot, teacher_prefix_one_hot, price_wo_std, teacher_prev_proj_wo_std, title_tfidf, text_tfidf))
Y_train_up = train_upsampled['project_is_approved']
print("Data Matrix Dimensions:", X_train_up.shape)
print("Target Variable Dimensions:", Y_train_up.shape)

# cross-validation data
X_cv_up = hstack((cv_categories_one_hot, cv_sub_categories_one_hot, cv_school_state_one_hot, cv_teacher_prefix_one_hot, cv_price_wo_std, cv_teacher_prev_proj_wo_std, cv_title_tfidf, cv_text_tfidf))
Y_cv_up = cv['project_is_approved']
print("CV Data Matrix Dimensions:", X_cv_up.shape)
print("CV Target Variable Dimensions:", Y_cv_up.shape)

# test_data
X_test_up = hstack((test_categories_one_hot, test_sub_categories_one_hot, test_school_state_one_hot, test_teacher_prefix_one_hot, test_price_wo_std, test_teacher_prev_proj_wo_std, test_title_tfidf, test_text_tfidf))
Y_test_up = test['project_is_approved']
print("Test Data Matrix Dimensions:", X_test_up.shape)
print("Test Target Variable Dimensions:", Y_test_up.shape)



Data Matrix Dimensions: (110392, 29865)
Target Variable Dimensions: (110392,)
CV Data Matrix Dimensions: (21850, 29865)
CV Target Variable Dimensions: (21850,)
Test Data Matrix Dimensions: (21850, 29865)
Test Target Variable Dimensions: (21850,)


In [0]:
## Store the Data Matrix without Standardization

## Storing Data Matrix X values

with open('/content/drive/My Drive/Assignment2/X_train_up_tfidf', 'wb') as f:
  pickle.dump(X_train_up, f)


with open('/content/drive/My Drive/Assignment2/X_cv_up_tfidf', 'wb') as f:
  pickle.dump(X_cv_up, f)
  
with open('/content/drive/My Drive/Assignment2/X_test_up_tfidf', 'wb') as f:
  pickle.dump(X_test_up, f)
  

## Storing Target Variable Y values: 
'''
with open('/content/drive/My Drive/Assignment2/Y_train_tfidf', 'wb') as f:
  pickle.dump(Y_train, f)
  

with open('/content/drive/My Drive/Assignment2/Y_cv_tfidf', 'wb') as f:
  pickle.dump(Y_cv, f)
  
with open('/content/drive/My Drive/Assignment2/Y_test_tfidf', 'wb') as f:
  pickle.dump(Y_test, f)
'''

"\nwith open('/content/drive/My Drive/Assignment2/Y_train_tfidf', 'wb') as f:\n  pickle.dump(Y_train, f)\n  \n\nwith open('/content/drive/My Drive/Assignment2/Y_cv_tfidf', 'wb') as f:\n  pickle.dump(Y_cv, f)\n  \nwith open('/content/drive/My Drive/Assignment2/Y_test_tfidf', 'wb') as f:\n  pickle.dump(Y_test, f)\n"

In [0]:
## Standardize the Data Matrices
'''
scaler = StandardScaler(with_mean=False)
scaler.fit(X_train) # finding the mean and standard deviation of this data
print(f"Mean : {scaler.mean_[0]}, Standard deviation : {np.sqrt(scaler.var_[0])}")
# Now standardize the train data with above mean and variance.
X_train_std = scaler.transform(X_train)

scaler.fit(X_cv) # finding the mean and standard deviation of this data
print(f"Mean : {scaler.mean_[0]}, Standard deviation : {np.sqrt(scaler.var_[0])}")
# Now standardize the data with above mean and variance.
X_cv_std = scaler.transform(X_cv)

scaler.fit(X_test) # finding the mean and standard deviation of this data
print(f"Mean : {scaler.mean_[0]}, Standard deviation : {np.sqrt(scaler.var_[0])}")
# Now standardize the data with above mean and variance.
X_test_std = scaler.transform(X_test)'''