# Chapter 1

## What is Data Science
Data science is a combination of statistics and computer science. Data scientists are almost indistinguishable from software engineers and sizable amount of them are machine learning experts.
In summary, data scientists are people who extract insights from messy data.

## Finding Key Connectors
Lets say you are given a data and you are asked to identify "key connectors" among the data.

The data consists of a list of users, each with an `id` number and `name`. There is also another dataset consisting of friendship pairings by id number.

In [13]:
users = [
    { "id": 0, "name": "Hero" },
    { "id": 1, "name": "Dunn" },
    { "id": 2, "name": "Sue" },
    { "id": 3, "name": "Chi" },
    { "id": 4, "name": "Thor" },
    { "id": 5, "name": "Clive" },
    { "id": 6, "name": "Hicks" },
    { "id": 7, "name": "Devin" },
    { "id": 8, "name": "Kate" },
    { "id": 9, "name": "Klein" }
]

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
               (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

Since these users are represented as `dicts`, its is easy to augment them with new data.

If we wanted to add a list of friends for each user, we can first set each user's `friends` property to an empty list:

In [52]:
for user in users:
    user["friends"] = []
    print (user)

{'id': 0, 'name': 'Hero', 'friends': []}
{'id': 1, 'name': 'Dunn', 'friends': []}
{'id': 2, 'name': 'Sue', 'friends': []}
{'id': 3, 'name': 'Chi', 'friends': []}
{'id': 4, 'name': 'Thor', 'friends': []}
{'id': 5, 'name': 'Clive', 'friends': []}
{'id': 6, 'name': 'Hicks', 'friends': []}
{'id': 7, 'name': 'Devin', 'friends': []}
{'id': 8, 'name': 'Kate', 'friends': []}
{'id': 9, 'name': 'Klein', 'friends': []}


Now this attribute can be populated using the `frienships` data:

In [53]:
for i, j in friendships:
    # this works because users[i] is the user whose id is i
    users[i]["friends"].append(users[j]["name"]) # add i as a friend of j
    users[j]["friends"].append(users[i]["name"]) # add j as a friend of i

In [54]:
for user in users:
    print(user)

{'id': 0, 'name': 'Hero', 'friends': ['Dunn', 'Sue']}
{'id': 1, 'name': 'Dunn', 'friends': ['Hero', 'Sue', 'Chi']}
{'id': 2, 'name': 'Sue', 'friends': ['Hero', 'Dunn', 'Chi']}
{'id': 3, 'name': 'Chi', 'friends': ['Dunn', 'Sue', 'Thor']}
{'id': 4, 'name': 'Thor', 'friends': ['Chi', 'Clive']}
{'id': 5, 'name': 'Clive', 'friends': ['Thor', 'Hicks', 'Devin']}
{'id': 6, 'name': 'Hicks', 'friends': ['Clive', 'Kate']}
{'id': 7, 'name': 'Devin', 'friends': ['Clive', 'Kate']}
{'id': 8, 'name': 'Kate', 'friends': ['Hicks', 'Devin', 'Klein']}
{'id': 9, 'name': 'Klein', 'friends': ['Kate']}


With each `user dict` containing their respective list of friends, we can extrapolate more data of the graph relating to the friendship connections.

First find the total number of connections by summing up the lengths of all the `friends` lists:

In [67]:
def number_of_friends(id):
    """how many friends does _user_ have?"""
    return len(users[id]['friends']) # length of friend_ids list

total_connections = sum(number_of_friends(i) for i,user in enumerate(users)) # 24
print(total_connections)

24


In [79]:
for i, user in enumerate(users):
    print(number_of_friends(i))

2
3
3
3
2
3
2
2
3
1


Divide by number of users to find the average connections:

In [65]:
from __future__ import division # integer division is lame
num_users = len(users) # length of the users list
avg_connections = total_connections / num_users # 2.4

In [66]:
print(num_users)
print(avg_connections)

10
2.4


Another useful statistic is to find the most connected people (i.e. largest number of friends). Because the dataset is small,it can be sorted by "most friends" to "least friends":

In [71]:
# create a list (user_id, number_of_friends)
num_friends_by_id = [(user["id"], number_of_friends(i))
                     for i,user in enumerate(users)]

#sorted(num_friends_by_id, # get it sorted
 #      key=lambda (user_id, num_friends): num_friends, # by num_friends
  #     reverse=True) # largest to smallest


# each pair is (user_id, num_friends)
# [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3),
# (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]

In [72]:
print(num_friends_by_id)

[(0, 2), (1, 3), (2, 3), (3, 3), (4, 2), (5, 3), (6, 2), (7, 2), (8, 3), (9, 1)]
