# Working with data

_There is a helper function in scikit-learn to load files stores in a folder structure similar to the one below, where each subfolder corresponds to a label._

<img src="../img/load_files_example.png">

In [9]:
from sklearn.datasets import load_files
import numpy as np

reviews_train = load_files("../data/aclImdb/train/")

# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target

print(f"Type of text_train: {type(text_train)}")
print(f"Lenght of text_train: {len(text_train)}")
print(f"FIRST ENTRY \n\n {text_train[0]}")




Type of text_train: <class 'list'>
Lenght of text_train 75000
FIRST ENTRY 

 b'Full of (then) unknown actors TSF is a great big cuddly romp of a film.<br /><br />The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.<br /><br />The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.<br /><br />And for anyone who lived in Glasgow it\'s a great "Oh I know where that is" film.'


We can see that the entry has some html markup. It is usually better to clean it up before proceeding with the project.

In [10]:
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

# check the number of samples per label
print(f"Samples per class (training): {np.bincount(y_train)}")

# loading the test set using the same approach as the one above

reviews_test = load_files("../data/aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target

print(f"Number of documents in test data: {len(reviews_test)}")
print(f"Samples per class (Test): {np.bincount(y_test)} ")

text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

Samples per class (training): [12500 12500 50000]
Number of documents in test data: 5
Samples per class (Test): [12500 12500] 


## 7.1 Representing text as a Bag of Words

To do so, you mostly discart owrds that are not meaninful, such as stop-words. You basically count how many times each word appears in the document. Computing the bag of words consists of three steps:

**01 / Tokenization**: Split each word by a delimitator, such as space and punctuation. Each word becomes now a *token*.

**02 / Vocabulary building**: Collect all words that appear in any document, and then organise them (let's say, by alphabetic order)

**03 / Enconding:** For each document, count how often a give word appears in the document.

The bag-of-words representation is implemented in *CountVectorizer*, which is a transformer

In [17]:
from sklearn.feature_extraction.text import CountVectorizer


vect = CountVectorizer().fit(text_train) #fits the vocabulary
X_train = vect.transform(text_train) # transforms the data (encoding)

print(f"Vocabulary size: {len(vect.vocabulary_)}")

print(f"X_train: \n {repr(X_train)}") # shows a representation of the sparse array used to store the data


Vocabulary size: 124255
X_train: 
 <75000x124255 sparse matrix of type '<class 'numpy.int64'>'
	with 10315542 stored elements in Compressed Sparse Row format>


In [18]:
# another way to get more details about the vocabulary
feature_names = vect.get_feature_names_out()

print(f"Number of features {len(feature_names)}")
print(f"First 20 features \n {feature_names[:20]}")
print(f"Features 150 to 200 \n {feature_names[150:200]}")
print(f"Every 2000th feature {feature_names[::2000]}")

Number of features 124255
First 20 features 
 ['00' '000' '0000' '0000000000000000000000000000000001' '0000000000001'
 '000000001' '000000003' '00000001' '000001745' '00001' '0001' '00015'
 '0002' '0007' '00083' '000ft' '000s' '000th' '001' '002']
Features 150 to 200 
 ['10x' '10x20' '10yr' '10yrs' '11' '110' '1100' '11001001' '1100ad'
 '110mph' '110th' '111' '1119' '112' '11211' '1123745598' '112th' '113'
 '1130pm' '1138' '114' '1146' '115' '1150' '116' '1165' '116577' '117'
 '1172' '118' '1188' '118k' '119' '1192' '1193' '11f' '11in' '11m' '11pm'
 '11th' '12' '120' '1200' '1200f' '1201' '1202' '12076' '1209' '120mints'
 '120minutes']
Every 2000th feature ['00' '_require_' 'aideed' 'announcement' 'asteroid' 'banquière'
 'besieged' 'bollwood' 'btvs' 'carboni' 'chcialbym' 'clotheth'
 'consecration' 'cringeful' 'deadness' 'devagan' 'doberman' 'duvall'
 'endocrine' 'existent' 'fetiches' 'formatted' 'garard' 'godlie' 'gumshoe'
 'heathen' 'honoré' 'immatured' 'interested' 'jewelry' 'kerchne

In [21]:
# First simple model

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(solver='saga', max_iter=500), X_train, y_train, cv=5)
print(f"Mean cross-validation accuracy {np.mean(scores)}")



Mean cross-validation accuracy 0.7005333333333332


