#  Conditional random fields

Assume that we do not know the actual labels of a bunch of opinions or any other dataset. We can perform unsupervised feature learning to extract the labels of a unlabeled dataset. This is very useful, since when we scrap free text from the web we can label it in order to perform unsupervised learning techniques to predict or classify new instances. I will use the same opinions of my [thesis](http://132.248.9.195/ptd2016/mayo/307602673/Index.html), the task was about perform sentiment analysis from a corpus of washing machine reviews in spanish (i.e. sentiment classification from 1-5 stars).


## First read the data:

In [1]:
import pandas as pd


df = pd.read_csv('/Users/user/jupyter_notebooks/unsupervised_feature_learning/data/new_corpus.csv', sep = '|', names=['id','content']).dropna()


X = df['content']

#y = df['label'].values
df.head()

Unnamed: 0,id,content
0,AEG_Electrolux_60840_Lavamat__Opinion_1506705....,'Silencio y facilidad. Con estas dos palabras...
1,AEG_Electrolux_62610_Lavamat__Opinion_2000923....,'Hola compis!No sabía como se ponía una lavad...
2,AEG_Electrolux_L14800VI__Opinion_2005396.html.txt,'Esta lavadora es de lo más práctica para aqu...
3,AEG_Electrolux_L6227FL__Opinion_2140710.html.txt,'Buenas tardes amigos y compañeros de ciao......
4,AEG_Electrolux_L62280FL__Opinion_2151025.html.txt,'Empecemos por una ventaja muy importante: la...


## Vectorize our data

In [2]:

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X = count_vect.fit_transform(X)
X.toarray()

array([[0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## With K-means:

In [3]:
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=5)
k_means.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

## We learned the labels:

In [4]:
k_means.labels_[::10]

array([1, 0, 1, 1, 1, 1, 1, 1, 2, 3, 2, 1, 0, 1, 1, 1, 1, 1, 0, 2, 0, 0, 1,
       1, 1, 1, 1, 1, 4, 1, 1, 0, 4, 1, 4, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 2,
       1, 0, 4, 1, 0, 4, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 4, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 4, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 4, 0, 1, 1, 4, 1,
       1, 1, 1, 4, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 4, 1, 1, 1, 1, 1, 4, 1, 0, 0, 4, 1, 1, 1, 4, 1, 1, 1, 0, 2, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 4, 2, 0, 1, 0, 0, 0, 4, 1, 0,
       1, 1, 4, 1, 1, 1, 1, 4, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 3, 1,
       1, 1, 1, 1, 2, 1, 1, 2, 0, 1, 1, 1, 0, 2, 1, 1, 1, 2, 1, 1, 0, 0, 1,
       1, 1, 1, 3, 3, 1, 1, 1, 1], dtype=int32)

In [5]:
y = k_means.labels_
type(y)

numpy.ndarray

## Now let's rearrange the dataframe 

In [6]:
df['labels'] = y

In [13]:
df.to_csv('/Users/user/jupyter_notebooks/unsupervised_feature_learning/data/new_labeled_corpus.csv', columns=['id', 'content','label'],index = False)
df.head()

Unnamed: 0,id,content,labels
0,AEG_Electrolux_60840_Lavamat__Opinion_1506705....,'Silencio y facilidad. Con estas dos palabras...,1
1,AEG_Electrolux_62610_Lavamat__Opinion_2000923....,'Hola compis!No sabía como se ponía una lavad...,1
2,AEG_Electrolux_L14800VI__Opinion_2005396.html.txt,'Esta lavadora es de lo más práctica para aqu...,1
3,AEG_Electrolux_L6227FL__Opinion_2140710.html.txt,'Buenas tardes amigos y compañeros de ciao......,1
4,AEG_Electrolux_L62280FL__Opinion_2151025.html.txt,'Empecemos por una ventaja muy importante: la...,1


In [8]:
df['labels'].value_counts()

1    1758
0     536
2     183
4     105
3      36
Name: labels, dtype: int64

In [6]:
from pystruct.models import ChainCRF
from pystruct.learners import FrankWolfeSSVM
model = ChainCRF()
ssvm = FrankWolfeSSVM(model=model, C=.1, max_iter=10)
ssvm.fit(X_train, y_train) 
FrankWolfeSSVM()

ssvm.score(X_test, y_test) 


ImportError: No module named 'pystruct'