# Genre classification with `sklearn`

1\. **Upload the "book_reviews.csv" from your machine, following the [Colab documentation](https://colab.research.google.com/notebooks/io.ipynb). This file contains 10,000 English language book reviews from Goodreads, with genre, age and star rating labels. Uploading may take a minute or so.**

2\. **Load the .csv file into a [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv). This makes it easy to acess and filter data.**

3\. **Now you can construct the document-term matrix. The [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class counts how often each word occurs in each document. Optionally, you can also pass `ngram_range` as a parameter, to see if combinations of multiple words are better predictors for ratings. Define the output of the `fit_transform` function on `'tokenised_text'` as your feature matrix `X`, and the star ratings (`'rating_no'`) as the variable `y` you're trying to predict.**

To inspect the words in the document-term matrix, you can call `get_feature_names_out()` on the vectorizer.

Alternatively, you could also use a [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer): this class counts how often a word occurs in a document and weighs it against how often the word occurs in the whole corpus. This is a way to eliminate words that are frequent but not very meaningful. You can play around with different vectorizers to see how they affect your results.

4\. **Now we can define a baseline model: use the `DummyClassifier` to always predict the most frequent genre in the dataset.**

5\. **After defining your document-term matrix, you can split the data into train- and test sets. Note that `random_state` is used so that the split will be the same for everyone in the group, such that different random selections don't cause slightly different results.**

6\. **Now pick one of the following classifiers:**
- [K-Nearest Neighbor classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
- [Multionimal Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
- [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
- [Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
- [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

7\. **Find the parameters which lead to best results. You can also automatate this with [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), as shown below.**

8\. **Try combining multiple classifiers, for instance with a [Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier) Can you get a better result?**