# Can we predict the category of an app from the app's description?

We wanted to see if we can create a Machine Learning model that will accuretly predict the category of an app from the app's description. To do this we created a web scraper to scrape data from the GooglePlay Store. After cleaning our data, we performed NLTK, Feature Engineering and Model Fitting to create an optimal ML Model. 

<br>

<br>

## Web Scraping

We created a web scraper to collect 60 app descriptions per category for 18 categories. After collecting our data we saved it to a file in dictionary format. 

In [12]:
import re
import pandas as pd
import numpy as np
read_dictionary = np.load('my_file.npy').item()

edu  = read_dictionary

print(edu.keys())
edu['EDUCATION']

dict_keys(['EDUCATION', 'BUSINESS', 'DATING', 'SPORTS', 'WEATHER', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'BEAUTY', 'MUSIC_AND_AUDIO', 'NEWS_AND_MAGAZINES', 'SOCIAL', 'SHOPPING', 'PRODUCTIVITY', 'PHOTOGRAPHY', 'MEDICAL', 'PARENTING', 'COMMUNICATION', 'TOOLS'])


["⭐Es una App genial porque aprendes palabras nuevas y practicas la pronunciación a cada momento que desbloqueas tu pantalla.⭐Aprendes cada vez que abres tu teléfono. La forma más inconsciente de aprender eficazmente.¿Cuántas veces miras tu móvil al día?Lo miras un promedio de 100 veces al día y lo desbloqueas al menos 50 veces.¡Si estudiaras vocabulario en esos momentos en los que miras el móvil, podrías aprender unas 3,000 palabras en tan solo un mes!Wordbit inglés es una App que te permite estudiar inglés desde la pantalla de bloqueo de tu móvil.Excava valiosos tesoros desde tu pantalla bloqueada. Formará parte de tu rutina de aprendizaje.Además, ¡es totalmente gratis! ■ La memorización del vocabulario es la clave para el aprendizaje de cualquier lengua extranjeray la técnica más básica para memorizar el vocabulario es la repetición.Puedes memorizar el vocabulario mirándolo repetidamente y sin darte cuenta.¡Nuestra aplicación no sólo te ayudará a aprender nuevas palabras, sino tambi

<br>

## NLTK _ Natural Language Tool Kit

In [2]:
import nltk
import sklearn

from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
from nltk.stem.snowball import SnowballStemmer

pattern = "([a-zA-Z]+(?:'[a-z]+)?)"

# stop words
from nltk.corpus import stopwords
stopwords.words("english")

stop_words = set(stopwords.words('english'))

# stem words
stemmer = SnowballStemmer("english")

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

#uses regex to tokenize words and capture them from the description, lowers the capitilization 
#remove stop words, reduce to stem words, and joins them all in a string
def text_cleaner(review):   
    art_tokens_raw = nltk.regexp_tokenize(review, pattern)
    art_tokens = [i.lower() for i in art_tokens_raw]
    art_tokens_stopped = [w for w in art_tokens if not w in stop_words]
    art_stemmed = [stemmer.stem(word) for word in art_tokens_stopped]
    cleaned = ' '.join(art_stemmed)
    return cleaned


#iterates through the dictionary values in each key (category) 
#and cleans each description and adds it back to a new list
def dict_cleaner(dictionary):
    review_list = []
    for c, d in dictionary.items():
        for review in d:
            cleaned = text_cleaner(review)
            review_list.append(cleaned)
    return review_list

review_list = dict_cleaner(edu)

response = tfidf.fit_transform(review_list)

df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())

df.head()

Unnamed: 0,aa,aac,aaptiv,aarp,aask,ab,abajo,abandon,abc,abcmous,...,zoompres,zoomterrain,zoomwithushav,zoosk,zte,zulili,zulu,zulufor,zulupermiss,zumba
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
