# Lambda School Data Science - A First Look at Data



## Lecture - let's explore Python DS libraries and examples!

The Python Data Science ecosystem is huge. You've seen some of the big pieces - pandas, scikit-learn, matplotlib. What parts do you want to see more of?

In [1]:
# TODO - we'll be doing this live, taking requests
# and reproducing what it is to look up and learn things



2

## Assignment - now it's your turn

#UPick at least one Python DS library, and using documentation/examples reproduce in this notebook something cool. It's OK if you don't fully understand it or get it 100% working, but do put in effort and look things up.

In [None]:
Guide to using eli5 and XGBoost

In [1]:
import csv
import numpy as np

with open(r'C:\Users\Thier\Downloads\eli5-master\eli5-master\notebooks\titanic-train.csv', 'rt') as f:
    data = list(csv.DictReader(f))
data[:1]

[OrderedDict([('PassengerId', '1'),
              ('Survived', '0'),
              ('Pclass', '3'),
              ('Name', 'Braund, Mr. Owen Harris'),
              ('Sex', 'male'),
              ('Age', '22'),
              ('SibSp', '1'),
              ('Parch', '0'),
              ('Ticket', 'A/5 21171'),
              ('Fare', '7.25'),
              ('Cabin', ''),
              ('Embarked', 'S')])]

In [2]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

_all_xs = [{k: v for k, v in row.items() if k != 'Survived'} for row in data]
_all_ys = np.array([int(row['Survived']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
    all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))

891 items total, 38.4% true


In [3]:
for x in all_xs:
    if x['Age']:
        x['Age'] = float(x['Age'])
    else:
        x.pop('Age')
    x['Fare'] = float(x['Fare'])
    x['SibSp'] = int(x['SibSp'])
    x['Parch'] = int(x['Parch'])

In [4]:
import warnings
# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+
warnings.filterwarnings('ignore', category=DeprecationWarning)
from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

class CSCTransformer:
    def transform(self, xs):
        # work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
        return xs.tocsc()
    def fit(self, *args):
        return self

clf = XGBClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, CSCTransformer(), clf)

def evaluate(_clf):
    scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
    _clf.fit(train_xs, train_ys)  # so that parts of the original pipeline are fitted

evaluate(pipeline)

Accuracy: 0.823 ± 0.071


In [5]:
booster = clf.get_booster()
original_feature_names = booster.feature_names
booster.feature_names = vec.get_feature_names()
print(booster.get_dump()[0])
# recover original feature names
booster.feature_names = original_feature_names

0:[Sex=female<-9.53674316e-07] yes=1,no=2,missing=1
	1:[Age<13] yes=3,no=4,missing=4
		3:[SibSp<2] yes=7,no=8,missing=7
			7:leaf=0.145454556
			8:leaf=-0.125
		4:[Fare<26.2687492] yes=9,no=10,missing=9
			9:leaf=-0.151515156
			10:leaf=-0.0727272779
	2:[Pclass=3<-9.53674316e-07] yes=5,no=6,missing=5
		5:[Fare<12.1750002] yes=11,no=12,missing=12
			11:leaf=0.0500000007
			12:leaf=0.175193802
		6:[Fare<24.8083496] yes=13,no=14,missing=14
			13:leaf=0.0365591422
			14:leaf=-0.151999995



In [6]:
from eli5 import show_weights
show_weights(booster, vec=vec)

Weight,Feature
0.4278,Sex=female
0.1949,Pclass=3
0.0665,Embarked=S
0.0510,Pclass=2
0.0420,SibSp
0.0417,Cabin=
0.0385,Embarked=C
0.0358,Ticket=1601
0.0331,Age
0.0323,Fare


In [7]:
from eli5 import show_prediction
show_prediction(booster, valid_xs[1], vec=vec, show_feature_values=True)

Contribution?,Feature,Value
1.673,Sex=female,1.000
0.479,Embarked=S,Missing
0.07,Fare,7.879
-0.004,Cabin=,1.000
-0.006,Parch,0.000
-0.009,Pclass=2,Missing
-0.009,Ticket=1601,Missing
-0.012,Embarked=C,Missing
-0.071,SibSp,0.000
-0.073,Pclass=1,Missing


In [8]:
no_missing = lambda feature_name, feature_value: not np.isnan(feature_value)
show_prediction(booster, valid_xs[1], vec=vec, show_feature_values=True, feature_filter=no_missing)

Contribution?,Feature,Value
1.673,Sex=female,1.0
0.07,Fare,7.879
-0.004,Cabin=,1.0
-0.006,Parch,0.0
-0.071,SibSp,0.0
-0.147,Age,19.0
-0.528,<BIAS>,1.0
-1.1,Pclass=3,1.0


In [9]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

vec2 = FeatureUnion([
    ('Name', CountVectorizer(
        analyzer='char_wb',
        ngram_range=(3, 4),
        preprocessor=lambda x: x['Name'],
        max_features=100,
    )),
    ('All', DictVectorizer()),
])
clf2 = XGBClassifier()
pipeline2 = make_pipeline(vec2, CSCTransformer(), clf2)
evaluate(pipeline2)

Accuracy: 0.839 ± 0.081


### Assignment questions

After you've worked on some code, answer the following questions in this text block:

#### 1.  Describe in a paragraph of text what you did and why, as if you were writing an email to somebody interested but nontechnical.

When you want to answer a question, there are a variety of models you can use to make predictions. Each of these models has benefits and drawbacks. For example, you could pick a model like a decision tree that has a high chance of being really right or really wrong. The model can sometimes mimic memorization, and simply project the data that was inputted. This model would be useless because it would only perform well on the data it was trained on, completely removing the point of the model. If you can just look at your data and draw the same conclusions as your model, what is the value? 

The ensemble method I used with this dataset aims to solve this problem. There are two types of methods that are relevant, bagging and boosting. Bagging consists of combining many rigid models to create a more flexible model. Boosting, the technique used with this dataset, combines many flexible models, with each model adapting to the mistakes of its predecessor. These models, called learners, are not very powerful in unique instances, but when combined they can provide powerful predictions.

### 2.  What was the most challenging part of what you did?
The most challenging part was getting everything to work given the fact that eli5 has not been updated in a while. Many of the features no longer work with the most recent version of scikitlearn


### 3.  What was the most interesting thing you learned?
The most interesting thing is I began to learn how XGBoost works. While not interesting, I also learned that it is worth spending some time to make sure your tools are up to date before you dive into coding. In that vein, sometimes the newest tool with less features is better than the older tool with more features. Debugging code for routines that should be smooth is a waste of time. 


### 4.  What area would you like to explore with more time?
How to use all the functions of the tool, how to set up an environment with the tools downgraded, alternative tools for visualizing importance. When a moving average might get you the same insight as a machine learning model and other machine learning tools.



## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub (and since this is the first assignment of the sprint, open a PR as well).

- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [scikit-learn documentation](http://scikit-learn.org/stable/documentation.html)
- [matplotlib documentation](https://matplotlib.org/contents.html)
- [Awesome Data Science](https://github.com/bulutyazilim/awesome-datascience) - a list of many types of DS resources

Stretch goals:

- Find and read blogs, walkthroughs, and other examples of people working through cool things with data science - and share with your classmates!
- Write a blog post (Medium is a popular place to publish) introducing yourself as somebody learning data science, and talking about what you've learned already and what you're excited to learn more about