# Lambda School Data Science - A First Look at Data



## Lecture - let's explore Python DS libraries and examples!

The Python Data Science ecosystem is huge. You've seen some of the big pieces - pandas, scikit-learn, matplotlib. What parts do you want to see more of?

In [0]:
# TODO - we'll be doing this live, taking requests
# and reproducing what it is to look up and learn things



2

## Assignment - now it's your turn

Pick at least one Python DS library, and using documentation/examples reproduce in this notebook something cool. It's OK if you don't fully understand it or get it 100% working, but do put in effort and look things up.

In [2]:
import pandas as pd
import random

betting_data = pd.read_csv("https://raw.githubusercontent.com/BuzzFeedNews/2016-01-tennis-betting-analysis/master/data/anonymous_betting_data.csv")

def get_outlier_openings(match_books):
  median = match_books['implied_prob_winner_open'].median()
  return match_books[
      (match_books["implied_prob_winner_open"] - median).abs() > .1
  ]

outlier_openings = betting_data\
  .groupby('match_uid').apply(get_outlier_openings)

selected_betting_data = betting_data[
    ~betting_data["match_book_uid"].isin(outlier_openings["match_book_uid"]) &
    ~betting_data["is_cancelled_or_walkover"]
].copy()

print("The selected data removes {0} matches."\
     .format(betting_data["match_uid"].nunique() - selected_betting_data["match_uid"].nunique()))


The selected data removes 539 matches.


In [3]:
print("there are {0:,} unique matches with odds in the dataset from {1:.0f} to {2:.0f}"\
     .format(selected_betting_data["match_uid"].nunique(), selected_betting_data["year"].min(), selected_betting_data["year"].max()))

there are 25,993 unique matches with odds in the dataset from 2009 to 2015


In [0]:
selected_betting_data["winner_movement"] = selected_betting_data["implied_prob_winner_close"] - selected_betting_data["implied_prob_winner_open"]
selected_betting_data["loser_movement"] = selected_betting_data["implied_prob_loser_close"] - selected_betting_data["implied_prob_loser_open"]
selected_betting_data["abs_winner_movement"] = selected_betting_data["winner_movement"].abs()

In [5]:
high_move_matches = selected_betting_data[(selected_betting_data["abs_winner_movement"] > 0.10)]\
    .sort_values("abs_winner_movement")\
    .drop_duplicates(subset="match_uid")\
    .copy()

print("there was movement greater than 10 percent points in {0: .2f}% of matches."\
    .format(round(100.0 * len(high_move_matches) / selected_betting_data["match_uid"].nunique(), 2)))
                                       

there was movement greater than 10 percent points in  10.76% of matches.


In [10]:
def find_high_movement_matches_for_player(name):
    high_move_matches = selected_betting_data[
            (((selected_betting_data["winner_movement"] > 0.10) & 
              (selected_betting_data["loser"] == name)) |
             ((selected_betting_data["loser_movement"] > 0.10) &
              (selected_betting_data["winner"] == name)))]\
            .sort_values("abs_winner_movement")\
            .drop_duplicates(subset="match_uid")\
            .copy()
    return pd.Series([name, len(high_move_matches), len(high_move_matches[high_move_matches["loser"] == name])])
  
all_players = pd.DataFrame(selected_betting_data["loser"].unique()).rename(columns={0: "name"})
  
player_high_move_counts = all_players["name"].apply(find_high_movement_matches_for_player)\
  .rename(columns={0: "name", 1: "high_move_matches", 2: "high_move_losses"})
  
selected_players = player_high_move_counts[(player_high_move_counts["high_move_losses"] > 10)].copy()
  
print("there are {0} players with more than 10 losses in high-move matches.".format(len(selected_players)))

there are 39 players with more than 10 losses in high-move matches.


In [0]:
class Player(object):
    def __init__(self, player_name):
        self.name = player_name
        self.matches = self.get_matches()
        self.wins = len(self.matches[self.matches["winner"] == self.name])

    def get_matches(self):
        player_matches = selected_betting_data[
            (((selected_betting_data["winner_movement"] > 0.10) & 
              (selected_betting_data["loser"] == self.name)) |
             ((selected_betting_data["loser_movement"] > 0.10) &
              (selected_betting_data["winner"] == self.name)))]\
            .sort_values("abs_winner_movement", ascending=False )\
            .drop_duplicates(subset="match_uid")\
            .copy()
        player_matches["player_odds_open"] = player_matches\
            .apply(lambda x: x["implied_prob_winner_open"] if x["winner"] == self.name else x["implied_prob_loser_open"],axis=1)
        player_matches["player_odds_close"] = player_matches\
            .apply(lambda x: x["implied_prob_winner_close"] if x["winner"] == self.name else x["implied_prob_loser_close"],axis=1)
        return player_matches

    def sim_once(self, odds_type="open"):
        wins = 0
        for i, m in self.matches.iterrows():
            if m["player_odds_"+odds_type] > random.random():
                wins += 1
        return wins
    
    def sim_x_times(self, x, odds_type="open"):
        return [ self.sim_once(odds_type) for n in range(x) ]
    
    def pct_sims_with_more_than_x(self, x_times, odds_type="open"):
        return float(len( [ x for x in self.sim_x_times(x_times, odds_type) if x <= self.wins ] )) / x_times
      

N_SIMULATIONS = 100

def get_likelihood(player_name):
    player = Player(player_name)
    return player.pct_sims_with_more_than_x(N_SIMULATIONS, "open")
  
selected_players["likelihood_open"] = selected_players["name"].apply(get_likelihood)

In [19]:
def classify_likelihood(likelihood):
  if likelihood < (0.05 / len(selected_players)): return "****"
  elif likelihood < .001: return "***"
  elif likelihood < .01: return "**"
  elif likelihood < .05: return "*"
  return""

selected_players["likelihood_level_open"] = selected_players["likelihood_open"].apply(classify_likelihood)

selected_players[
    selected_players["likelihood_open"] < 0.05
].sort_values("likelihood_open")

Unnamed: 0,name,high_move_matches,high_move_losses,likelihood_open,likelihood_level_open
0,0ffe23c8b80916f6b2c23a52e08018374d68d12f49b261...,18,15,0.0,****
3,79784720fab57e7cc611e07c258cf49f484b9cee01bf47...,14,11,0.0,****
304,573dad2e08250afa99aa704c7ea888b421bcf06bd00aab...,14,13,0.0,****
58,f16cc81d239ad735c51cc71442cda44c4d1a9323eb4101...,16,15,0.0,****
69,4f7f8e1b43947b2fb123afb92263b4a863daa87a4de44c...,19,13,0.0,****
293,6702a5de750846f45a3d977f50023c1b20156c61949f2f...,12,12,0.0,****
251,822130a3121c663ea88c6429830f23a794791fed013f6e...,23,18,0.0,****
82,9c92af8ca1b57024bd0a39b73db8be44b25bcde4115549...,15,14,0.0,****
235,33367d214715ab5f5e335cd67dbc90e62983b98e5278a4...,16,15,0.0,****
86,05f3190e5053090035664800d1f52203b40a826cf7f065...,15,13,0.01,*


### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  Describe in a paragraph of text what you did and why, as if you were writing an email to somebody interested but nontechnical.

2.  What was the most challenging part of what you did?

3.  What was the most interesting thing you learned?

4.  What area would you like to explore with more time?




In the previous code,  a dataset of players and their matches as well as the match odds is analyzed.  It was referenced in a report to analyze the likelihood that matches were fixed in the Professional tennis Association tour.  After giving some information about the data, the last piece of the code runs simulations (in this case only 100 because my computer is slow at running more), and telling the likelihood that a player participated in a "questionable" match, or fixed match.

## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub (and since this is the first assignment of the sprint, open a PR as well).

- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [scikit-learn documentation](http://scikit-learn.org/stable/documentation.html)
- [matplotlib documentation](https://matplotlib.org/contents.html)
- [Awesome Data Science](https://github.com/bulutyazilim/awesome-datascience) - a list of many types of DS resources

Stretch goals:

- Find and read blogs, walkthroughs, and other examples of people working through cool things with data science - and share with your classmates!
- Write a blog post (Medium is a popular place to publish) introducing yourself as somebody learning data science, and talking about what you've learned already and what you're excited to learn more about.