# Intro to Data Science @ SzISz Part VIII.
## Neural Network recap, Git, Kaggle challange

### Table of contents
- <a href="http://playground.tensorflow.org/">Playing with Neural Networks</a>
- <a href="https://datanice.wordpress.com/2016/04/10/step-by-step-kaggle-competition-tutorial/">Kaggle tutorial</a>
- <a href="#What-is-Git?">Git Flow</a>
- <a href="#Experiments">Experiments</a>
    

## What is Git?
Git is a distributed version control system. It allows the users to keep track of the changes they made to their projects. It is usually used to follow the changes in source code files. 

## Why is it important?
By using a version tracking system the changes made to the projects are easily withdrawed. With multiple people working on the same files, a version control system helps to manage the changes and merge them together. 

## Tools
- Git
- GitHub

## What is GitHub?
GitHub is a public code hosting service which makes use of the git version control system. It provides multiple extra feature to coordinate the project work.

## The Git Flow
See: https://guides.github.com/introduction/flow

In [1]:
%matplotlib inline
import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed = 42

## Experiments

Let's aquire the <a href="https://www.kaggle.com/c/job-salary-prediction">kaggle job salary dataset</a>, and run some experiments on it!

In [29]:
import string
import operator
from collections import Counter

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Normalizer

In [10]:
base = '../../data/kaggle_job/'
df = pd.read_csv(base+'Train_rev1.csv')
df.head(1)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [20]:
class ItemSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, column, extend=False):
        self.column = column
        self.extend = extend
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.extend:
            return X[self.column].values[:, np.newaxis]
        return X[self.column]

In [23]:
feat = FeatureUnion(transformer_list=[
    ('desc', Pipeline([
        ('sel', ItemSelector('FullDescription')),
        ('tfidf', TfidfVectorizer(max_df=.9, min_df=4)),
        ('svd', TruncatedSVD(n_components=90))
    ])),
    ('salary', Pipeline([
        ('sel', ItemSelector('SalaryNormalized', True)),
        ('norm', Normalizer())
    ]))
])

In [24]:
X = feat.fit_transform(df)



In [51]:
def replace(word):
    replacable = set(string.printable) - set(string.ascii_letters).union(set(string.digits))
    for char in replacable:
        word = word.replace(char, '')
    return word.lower()

In [52]:
def clean(title):
    return [replace(word) for word in title.split()]

In [78]:
y = df['Title'].fillna('').map(clean)

In [57]:
top = Counter(reduce(operator.add, y.values))

In [79]:
top1000 = dict(top.most_common(1000)).keys()

In [80]:
y_transformed = y.map(lambda title: ' '.join([word for word in title if word in top1000]))

In [None]:
# innen train, transform, hajrá! :D

## Feladatok:
- Kristóf létrehozza a csapatot
- Mindenki beregisztrál
- Mindenki csatlakozik
- Kristóf submitel egy baseline megoldást
- Adatot letöltjük
- Fórumot elolvassuk
- Adatot tisztítjuk
- Adatot szelektáljuk
- Basic modellt előállítjuk