# landscraper - doing the dirty work for intellectual property (IP) decisions

__Contributer: Akhil Jindal__ | https://github.com/akhil-jindal/

## Library Imports:

In [1]:
from sklearn.datasets import *
from sklearn import model_selection
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

import glob as glob
import numpy as np

## Building a pipeline

In [2]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                           alpha=1e-3, random_state=42,
                           max_iter=5, tol=None)),
])

## Identifying and processing training inputs:

In [3]:
corpus = "../data/corpus/"
patents = load_files(corpus)
classifications = patents.target_names

## Split training data:

In [4]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    patents.data, patents.target, train_size = 0.7)



## Train model and testing it:

In [5]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...ty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False))])

In [6]:
prediction = pipeline.predict(X_test)

In [8]:
np.mean(prediction == y_test)

0.6440129449838188

## Results:

As we can see, the current SGDClassifier performs decently well with a 64% accuracy in predicting the correct classifications for the test data.  I will tune the parameters in [./notebooks/gridsearch.ipynb](http://localhost:8888/notebooks/Dropbox/src/landscraper/notebooks/gridsearch.ipynb) to see if we can get better accuracy.