# Textual Categorical-Features : Ordinal and Nominal

In [None]:
import pandas as pd

If you have a categorical feature, how you represent it in your dataset will depend on if it's ordinal or nominal categorical feature. In the case of **ordinals**, you should create a **mapping** of increasing integers to each possible unique value of the feature. Any entries not found in your designated categories list will be mapped to -1:

In [39]:
# creating a dummy dataset
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']

# sample datafram with some results. 'Mad' is not part of the ordered_satisfaction:
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})

# using astype() to convert the dataframe into a category that is ordered by ordered_satisfaction
df.satisfaction = df.satisfaction.astype(dtype="category", ordered=True, categories=ordered_satisfaction)

print(df) # the NaN is because 'Mad' is not a category
print(df.dtypes) # is a category
print()

# changing the dtype to numeric with cat.codes
df.satisfaction = df.satisfaction.cat.codes
print(df)
print(df.dtypes)

  satisfaction
0          NaN
1        Happy
2      Unhappy
3      Neutral
satisfaction    category
dtype: object

   satisfaction
0            -1
1             3
2             1
3             2
satisfaction    int8
dtype: object


  


On the other hand, if your feature is **nominal** and therefore **lacking any inherent numeric ordering**, then you have two options. The first is you can encoded the fature similar as you did above. This would be a fast-and-dirty approach. While you're just getting accustomed to your dataset and taking it for its first run through your data analysis pipeline, this method might be the most convenient:

In [47]:
df = pd.DataFrame({'vertebrates':['Bird','Bird','Mammal','Fish','Amphibian','Reptile','Mammal']})

# this time we do not use the ordered=True in astype(), so each new sample type will receive a new number
df['vertebrates_numeric'] = df.vertebrates.astype(dtype="category").cat.codes

In [48]:
df

Unnamed: 0,vertebrates,vertebrates_numeric
0,Bird,1
1,Bird,1
2,Mammal,3
3,Fish,2
4,Amphibian,0
5,Reptile,4
6,Mammal,3


Notice how this time, **ordered=True was not passed in**, nor was a specific ordering listed. Because of this, Pandas automatically encodes your nominal entries in alphabetical order. This approach is fine for getting your feet wet; but the issue it has is that it introduces an ordering to a categorical list of items that inherently has none. This may or may not cause problems for you in the future. If you aren't getting the results you hoped for, or even if you are getting the results you desired but would like to further increase the result accuracy, then a more precise encoding approach would be to separate the distinct values out into individual boolean features:

In [49]:
df = pd.DataFrame({'vertebrates':['Bird','Bird','Mammal','Fish','Amphibian','Reptile','Mammal']})

# get_dummies() convert a categorical variable into dummy/indicator variable
# create a column where the variable is 1 if exist or 0 if don't.
df = pd.get_dummies(df,columns=['vertebrates'])
df

Unnamed: 0,vertebrates_Amphibian,vertebrates_Bird,vertebrates_Fish,vertebrates_Mammal,vertebrates_Reptile
0,0,1,0,0,0
1,0,1,0,0,0
2,0,0,0,1,0
3,0,0,1,0,0
4,1,0,0,0,0
5,0,0,0,0,1
6,0,0,0,1,0


These newly created features are called boolean features because the only values they can contain are either 0 for non-inclusion, or 1 for inclusion. Pandas .get_dummies() method allows you to completely replace a single, nominal feature with multiple boolean indicator features. This method is quite powerful and has many configurable options, including the ability to return a SparseDataFrame, and other prefixing options. Its benefit over Method #1 above is that no erroneous ordering is introduced into your dataset.

# Pure Textual Features

If you are trying to "featurize" a body of text such as a webpage, a tweet, a passage from a newspaper, an entire book, or a PDF document, creating a corpus of words and counting their frequency is an extremely powerful encoding tool. This is also known as the **Bag of Words model**, implemented with the **CountVectorizer() method in SciKit-Learn**. Even though the grammar of your sentences and their word-order are completely discarded, this model has accomplished some pretty amazing things, such as being able to correctly identifying J.K. Rowling's writing from a blind line up of authors:

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

#### CountVectorizer:
Convert a collection of text documents to a matrix of token counts.

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

In [60]:
corpus = ["Authman ran faster than Harry because he is an athlete.","Authman and Harry ran faster and faster.", "W00t."]

In [61]:
bow = CountVectorizer()
X = bow.fit_transform(corpus) # this produces a Sparse Matrix. Pandas only save the the non zero values.
X

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [62]:
bow.get_feature_names()

['an',
 'and',
 'athlete',
 'authman',
 'because',
 'faster',
 'harry',
 'he',
 'is',
 'ran',
 'than',
 'w00t']

In [65]:
# We can see the sparse matrix with:
X.toarray()
# the word 'an' is only once used in the first sentence, the word 'and' is twice used in the second sentence ... 

array([[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [0, 2, 0, 1, 0, 2, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)

Notes:


Notice that X is not stored as a regular [n_samples, n_features] dataframe that you've become accustomed to. Rather, it is a SciPy compressed, sparse, row matrix. SciPy is a library of mathematical algorithms and convenience functions that further extend NumPy. The reason X is now a sparse matrix instead of a classical dataframe is because even with this small example of two sentences, 11 features were created. The average English speaker knows around 8000 unique words. If each sentence were an 8000-sized vector sample in your dataframe, consisting mostly of 0's, it would be a poor use of memory.

To avoid this, SciPy implements sparse matrices as Python dictionaries: only the keys that have a value get stored, and everything else is assumed to be empty. You can always convert it back to a regular Python list by using the .toarray() method, but this converts it to a dense array, which might not be desirable due to memory usage reasons. To use your compressed, sparse, row matrix in Pandas, you're going to want to convert it to a Pandas SparseDataFrame. More notes on that in the Dive Deeper section.

The bag of words model has other configurable parameters you can tune, such as having it pay attention to the order of words in your text. In such implementations, pairs or tuples of successive words are used to build the corpus instead of individual words:

```python
bow.get_feature_names()
['authman ran', 'ran faster', 'faster than', 'than harry', 'harry because', 'because he', 'he is', 'is an', 'an athlete', 'authman and', 'and harry', 'harry ran', 'faster and', 'and faster'] ```

Another configurable parameter is to have CountVectorizer() use frequencies instead of counts. This is useful when you have documents of different lengths. Words show up more often in the larger document than the shorter one simply based on it's length; so normalizing the word count by the total number of words in each document create a more fair 'direct' comparison between the two.

# Image Feature Encoding

In [70]:
from scipy import misc

In [74]:
img = misc.imread('cat.jpg') # ndarray
type(img)

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  """Entry point for launching an IPython kernel.


numpy.ndarray

In [75]:
img.shape  # 3 color channels

(3000, 4000, 3)

In [76]:
img.dtype # integers with 8bits - they range from 0 to 255 (for each color channel)

dtype('uint8')

In [78]:
img = img[::2,::2] # always skiping one pixel, because the pixels in pairs are basically the same color.
img.shape # shrink the image

(750, 1000, 3)

In [79]:
img = (img/225.).reshape(-1,3) # red, green, blue channels are normalized to 0-1 and reshaping to a 1D array.

In [82]:
red = img[:,0]
green = img[:,1]
blue = img[:,2]

grey = 0.299*red + 0.587*green + 0.114*blue  # convert the image into grayscales. Luminance Formula

# we can do machine learning with gray!

# Audio Feature Encoding

In [83]:
import scipy.io.wavfile as wavfile

In [85]:
sample_rate, audio_data = wavfile.read('sexta.wav')

In [87]:
sample_rate  # 348000 kilohertz

48000

In [88]:
audio_data  # we can do machine learning with the audio

array([[ 0,  0],
       [ 0,  0],
       [ 0,  0],
       ...,
       [57, 55],
       [51, 49],
       [45, 45]], dtype=int16)