# An Introduction to Python: Part 2 -- <font color=red>SOLUTION</font>
### with a Machine Learning Context

Recall the problem from `Part 1`, 

## The Problem

* Domain: Crime
    * Dataset: history of crimes
        * $(X, y)$
        * $X$ features crime/person
        * $y$ target violent yes/no
    * <font color=red>Goal: risk of violence</font>
        * classify a new suspect into "VIOLENT" or "NONVIOLENT"
        * probability of being violent
* Known Features $X$
    * Criminal Record
        * $X_0$ number of crimes
    * Personal Details
        * $X_1$ Area
        * $X_2$ Age
        * $X_3$ In Care System? 
* Unknown Target $y$
    * $y$ is whether you are a violent offender

With the datasets,

In [4]:
X = [
    (3, "London", 30, False),
    (4, "London", 20, True),
    (0, "Manchester", 18, True),
    (1, "Manchester", None, False), # could be a list [,,,,]
    (2, "Manchester", 50, False),
]

In [5]:
y = [
    True, 
    True, 
    False, 
    False, 
    False
]

In [6]:
len(y) == len(X)

True

### The Solution

* Score each $X$ in terms of how "violent" it is
    * add up all the scores
* We compare this total score to some cutoff (we can choose)
    * if you cross the threshold, you're violent

We define `suspect` to be the first entry in `X` for convenience in illustrating this process of scoring,

In [7]:
suspect = X[0]
suspect

(3, 'London', 30, False)

`score` is a *dictionary*, is a key-value datastructure. A KV data structure tags values by "keys", typically strings. 

In [8]:
score = {
    'crimes' : 4, # weight
    'age'    : 3,
    'in_care': 10
}

**Q. Define a `nonviolent_score` dictionary with alterantive entries.** 

In [9]:
nonviolent_score = {
    'crimes': 3,
    'age': 2, 
    'in_care' : 0
}

You can find the value if you know the key,

In [10]:
score['crimes'] # index = key

4

**Q. `print()` all the entries of your nonviolent dictionary .**

In [11]:
print(nonviolent_score['crimes'])
print(nonviolent_score['age'])
print(nonviolent_score['in_care'])

3
2
0


Note the index isnt a position, its a string *key*. 

A score is computed from weights multiplied by each feature of an observation, $X$, 

In [12]:
total_score = (
    score['crimes']  * suspect[0] + 
    score['age']     * suspect[2] + 
    score['in_care'] * suspect[3]
)

We look up the weights in `score` and multiply by the entires in a particular $X$, ie., the `suspect`. 

**Q. Calculate a `total_nonviolent_score` using your nonviolent dictionary and `suspect` .**

In [13]:
total_nonviolent_score = (
    nonviolent_score['crimes']  * suspect[0] + 
    nonviolent_score['age']     * suspect[2] + 
    nonviolent_score['in_care'] * suspect[3]
)

In [14]:
total_nonviolent_score

69

---

In python calculations cannot run across multiple lines, as the end of a line is a statement terminator (ends the action).

If you want to run across mutliple lines you can use paranethese,

In [19]:
1 + 
1

SyntaxError: invalid syntax (<ipython-input-19-57f8c229afcc>, line 1)

In [20]:
(1 + 
1)

2

In python whitespace is ignored between any pair of brackets, `()`, `[]`, `{}`, ... 

---

We now have a total score,

In [22]:
total_score

102

We compare this to some chosen cutoff to *predict*  if they are violent; ie., we *classify* them,

In [23]:
cutoff = 100

if total_score > cutoff:
    print("VIOLENT")
else:
    print("NONVIOLENT")

VIOLENT


**Q. Create an `if-else` decision which judges nonviolence; ie., use your `nonviolent_score` and your own `nonviolent_cutoff`.**

In [15]:
cutoff = 100

if total_nonviolent_score > cutoff:
    print("NONVIOLENT")
else:
    print("VIOLENT")

VIOLENT


### Improving the Weights


The chosen weights above were just based on intution, eg., that crimes are "`4` important". 

Let's compute them from the data.

In [29]:
y

[True, True, False, False, False]

In [28]:
X

[(3, 'London', 30, False),
 (4, 'London', 20, True),
 (0, 'Manchester', 18, True),
 (1, 'Manchester', None, False),
 (2, 'Manchester', 50, False)]

**Q. What locations do all violent crimials belong to?**

"London"

### How many crimes *are associated with* violence?

Notice in `y` the first two rows are "violent", so the number of crimes "associated with violent" is `mean(3, 4)`. 

We require knowing $y$ as we consider $X$ -- we only want to inspect $X$ entires if they're violent. 

We can use `zip()`.

----

#### Streaming Operations

The functions below (`range()` etc.) *compute values*, **on-demand**. Simply: when you loop over them, they give a value. 

This is efficient in that you don't pre-store any values. These are called generators ("generating"). 

In [30]:
range(0, 10)

range(0, 10)

In [125]:
for i in range(0, 3):
    print(i, "Ho")

0 Ho
1 Ho
2 Ho


**Q. `print()` the numbers in the `range(-5, 5)` using a `for`-loop.**

In [17]:
for i in range(-5, 5):
    print(i)

-5
-4
-3
-2
-1
0
1
2
3
4


Using `list` with a generator shows all the elements at once,

In [42]:
list(range(0, 3))

[0, 1, 2]

**Q. `list()` the numbers in the `range(-5, 5)`.**

In [18]:
list(range(-5, 5))

[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]

`range()` is often used when looping for a fixed number of iterations...

Eg., consider simulating a dataset and generating 10 points, 

In [48]:
new = []
for x in range(0, 10):
    new.append( 2 * x + 1 )
new

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

##### Aside: More Generators

Other generators include `zip`, and `enumerate`, 

In [126]:
zip(X, y) # we consider below

<zip at 0x7ff3783b8140>

`enumerate` computes an index and draws an element from an original dataset,

In [127]:
enumerate(X)

<enumerate at 0x7ff3783c4cc0>

In [49]:
for index, entry in enumerate("ABC"):
    print(index, entry)

0 A
1 B
2 C


`zip()` takes two-or-more collections and provides, on demand, an element from each,

In [37]:
for a, b, c in zip("ABC", [1,2,3], (True, True, None)):
    print(a, b, c)

A 1 True
B 2 True
C 3 None


In [38]:
letters = "ABC"
numbers = [1,2,3]
labels = (True, True, None)

abc = zip(letters, numbers, labels)

for a, b, c in abc:
    print(a, b, c)

A 1 True
B 2 True
C 3 None


**Q. Define a list of `colors` and `numbers`, using `zip()` and `for`, `print()`, a color and number from each list.**

In [19]:
colors = ["R", "G", "B"]
numbers = [2, 4, 5]

for c, n in zip(colors, numbers):
    print(c, n)

R 2
G 4
B 5


**Q. Define a list of `suspect_names` and `crime_loctions`, using `zip()` and `for`, `print()` one from each.**

In [21]:
suspect_names = ["Alice", "Eve"]
crime_locations = ["Liverpool", "Glasgow"]

for name, location in zip(suspect_names, crime_locations):
    print(name, location)

Alice Liverpool
Eve Glasgow


##### Aside: Understanding `zip`

If you use `list()` to convert a generator to a list, it will force the generator to compute all entires (ie., generate them)...

In [39]:
list(zip(letters, numbers, labels))

[('A', 1, True), ('B', 2, True), ('C', 3, None)]

You can see what it's generating. A new list of tuples of three elements, each entry coming from the original.

In [41]:
[(letters[0], numbers[0], labels[0]), 
 (letters[1], numbers[1], labels[1]), 
]

[('A', 1, True), ('B', 2, True)]

Understanding this is not essential for using `zip`. 

---

#### Solving the Problem

Recall we are aiming for realistic weights for `score`, 

In [51]:
y

[True, True, False, False, False]

In [50]:
X

[(3, 'London', 30, False),
 (4, 'London', 20, True),
 (0, 'Manchester', 18, True),
 (1, 'Manchester', None, False),
 (2, 'Manchester', 50, False)]

In [52]:
score

{'crimes': 4, 'age': 3, 'in_care': 10}

The idea:

* consider every entry in $X$ and $y$
    * if $y$ is True, ie., the suspect is violent
    * then keep their number of crimes ( the zero entry in the suspect)
        * keep = append to an empty list

In [22]:
vcrimes = [] # empty list

for suspect, is_violent in zip(X, y):
    if is_violent: 
        vcrimes.append(suspect[0]) # keep

**Q. Define another empty list called `vlocations` before the loop, and `.append` the supects location (ie., `suspect[2]`).** 

In [25]:
vcrimes = [] # empty list
vlocations = []

for suspect, is_violent in zip(X, y):
    if is_violent: 
        vcrimes.append(suspect[0]) # keep
        vlocations.append(suspect[1])

In [26]:
vcrimes # compare with X and y above, its the first two entries 

[3, 4]

**Q. `print()` your `vlocations`.**

In [27]:
print(vlocations)

['London', 'London']


We will take the `mean` of the violent crimes as the weight.

We can use `sum, len` or **import** a `mean` function,

In [29]:
# mean = ...statistics... # ie., a variable definition
from statistics import mean

The statistics library (module) in python contains various statistical functions.

`from library import abc` load the library (run some python files), "make available" some things they define. 

In [60]:
mean(vcrimes)

3.5

**Q. Show the `mean()` of the `range()` `-100` to `100`.**

In [32]:
list(range(-10, 10))

[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [31]:
mean(range(-100, 100))

-0.5

...Why?

In [34]:
(-100 + 99)/2

-0.5

**Q. What is `mean(range(0, 1_000, 100))`, why?**

In [41]:
mean(range(0, 1_000, 100)) # underscore is option, it's just ignored!

450

In [42]:
list(range(0, 1_000, 100))

[0, 100, 200, 300, 400, 500, 600, 700, 800, 900]

In [43]:
(0 + 900)/2

450.0

We update `crimes` in `score` to be this data-evidenced one,

In [62]:
score['crimes'] = mean(vcrimes)

In [64]:
score

{'crimes': 3.5, 'age': 3, 'in_care': 10}

### Making the Scores Realistic

The remaining task is to compute a weight for every $X$ entry which is derived from the data. 

* `mean(violent crimes)`
* `mode(violent locations)`
* `mean(violent ages)`
* `mean(violent in_cares)`

To compute these means we will create a new dataset called `violents` which keeps only violent suspects. 

In [59]:
violents = {
    'crimes': [],
    'locations': [],
    'ages': [],
    'in_cares': []
}

`violents` is a *dictionary* which holds four separate lists. Each will contain just the violent suspect data for each feature (type of $X$). 

In [60]:
for suspect, is_violent in zip(X, y):
    if is_violent:
        violents['crimes'].append(suspect[0]) # number of crimes
        violents['locations'].append(suspect[1]) # location
        violents['ages'].append(suspect[2]) # age
        violents['in_cares'].append(suspect[3]) # in_care?

**Q. Modify the loop above, `print()` the `suspect`'s details.**

In [61]:
for suspect, is_violent in zip(X, y):
    if is_violent:
        print("VIOLENT")
        print(suspect)
        violents['crimes'].append(suspect[0]) # number of crimes
        violents['locations'].append(suspect[1]) # location
        violents['ages'].append(suspect[2]) # age
        violents['in_cares'].append(suspect[3]) # in_care?
    else: # extra
        print("NONVIOLENT")
        print(suspect)

VIOLENT
(3, 'London', 30, False)
VIOLENT
(4, 'London', 20, True)
NONVIOLENT
(0, 'Manchester', 18, True)
NONVIOLENT
(1, 'Manchester', None, False)
NONVIOLENT
(2, 'Manchester', 50, False)


---

Aside recall that in a loop the variable being defined, here eg., `suspect` is just each entry in `X`,

In [62]:
suspect = X[0]
suspect[0] # the number of crimes of the suspect

3

`suspect[0]` is the first entry, *in*, the first entry of `X`, ie., the number of crimes.

In [63]:
X[0]

(3, 'London', 30, False)

---

`violents` now contains just the right data,

In [64]:
violents

{'crimes': [3, 4, 3, 4],
 'locations': ['London', 'London', 'London', 'London'],
 'ages': [30, 20, 30, 20],
 'in_cares': [False, True, False, True]}

In [65]:
from statistics import mode

These are still not good weights, as the `mode()` is some text -- and the `mean()`s all have different sizes. 

In [66]:
print(mean(violents['crimes']))
print(mode(violents['locations']))
print(mean(violents['ages']))
print(mean(violents['in_cares']))


3.5
London
25
0.5


In [69]:
violents['crimes']

[3, 4, 3, 4]

**Q. Import the `stdev` function from `statistics` and `print()` the standard deviation of the violent suspect's crimes, ages and in_cares.**

In [70]:
from statistics import stdev

stdev(violents['crimes'])

0.5773502691896257

What would a uniform score be?

In [97]:
X

[(3, 'London', 30, False),
 (4, 'London', 20, True),
 (0, 'Manchester', 18, True),
 (1, 'Manchester', None, False),
 (2, 'Manchester', 50, False)]

Let's rescale them,

* Crimes are on the scale, aprox. `0 to 10`
* Ages, 0 to 100
    * /10  puts it on `0 to 10`
* In Care just 0, 1
    * `*10` puts it on `0 to 10`

In [98]:
print(mean(violents['crimes']))
print(mode(violents['locations']))
print(mean(violents['ages'])/10)
print(10 * mean(violents['in_cares']))


3.5
London
2.5
5.0


We can't multiply by a location...

Let's compare the suspect location to the `mode()` of all violent-suspect locations,

In [100]:
"Manchester" == mode(violents['locations'])

False

Since `False` is `0` and `True` is `1`, -- we can just multiply by `10`, 

In [105]:
10 * ("Manchester" == mode(violents['locations']))

0

In [102]:
10 * ("London" == mode(violents['locations']))

10

##### The Final Score

In [72]:
score = {
    'crime'   : mean(violents['crimes']),
    'location': mode(violents['locations']),
    'age'     : mean(violents['ages'])/10,
    'in_care' : 10 * mean(violents['in_cares'])
}


Let's use this to compute a score for a new suspect,

In [73]:
new_suspect = (2, "Manchester", 23, True)

In [74]:
new_suspect_score = ( 
    new_suspect[0] * score['crime'] +
    new_suspect[2] * score['age'] + 
    new_suspect[3] * score['in_care'] + 
    
    10 * ( new_suspect[1] == score['location'])
)

**Q. `print()` each part of the score for this `new_suspect` ie., their crimes score, their ages score, etc.**

In [77]:
print("SCORES")
print("Crime",    new_suspect[0] * score['crime'] )
print("Age",    new_suspect[2] * score['age']  )
print("Care",    new_suspect[3] * score['in_care']  )
    
print("Location",    10 * ( new_suspect[1] == score['location']) )


SCORES
Crime 7.0
Age 57.5
Care 5.0
Location 0


In [120]:
new_suspect_score

69.5

We predict they are violent if they score exceeds some cutoff, 

In [118]:
cutoff = 100 # we can derive this from the data + judgement

if new_suspect_score > cutoff:
    print("VIOLENT")
else:
    print("NON_VIOLENT")

NON_VIOLENT


Aside: we can change the cutoff to be more "sensitive" or less to the risk of violence. 

**Q. Reflect on this problem. We started with $X$ features of crimes and $y$, whether the  offender was violent. We used averages over $X$ partitioned by $y$ to compute a score -- and used this score to classify whether a new suspect was violent. What are the benefits/flaws of such an approach?**

#### Benefit: Informative

We use knowbable information which *does* seem to related to whether people are violent. We are able to capture some measure which helps us make a deicision. 

#### Problem: Irrelevant

This measure *isnt* genuinely informative if $X$ is "irrelvant" to $y$  (eg., non-causal) -- we need to know whether the measures here are accidents or meaningful. 


#### Problem: Ethically Wrong & Irrelevant

Eg., It is likely that locations correlated with eg., areas of *over-policing*. And so it is unclear whether location data accurately captures crime rates. 

#### Problem: Encoding & Care

The formula for care converts a boolean (0, 1) into a number by multiplying by 10, 

In [82]:
suspect = X[0]
suspect

(3, 'London', 30, False)

In [83]:
10 * suspect[-1]

0

Any means of representing a category (care) as a number is always quite abitary. But... we could investigate better ways to do it!

Consider here, eg., being in care has a "0 or BIG" boost which can have a disproportionate impact on offenders near the threshold. 

#### Problem: Population Size & Choice

* We require more information/data to improve score
* More features/data 

---

## Stretch Questions

Learners with an existing background in this area...

1. Create a new notebook .
2. Choose a different problem domain.
3. Consider a dataset of features $X$, targets $y$ .
4. Produce a score-based classification system for your own dataset. 

Eg. Health, Sleep Quality, X = ...(HR, BP, Hours), y = [GOOD, BAD, OK]... .



EXTRA: revise the scoring system to use `Counter` (see below). 

### Appendix


There is a `Counter` utility in python which will process a dataset and produce a dictionary whose keys are the *unqiue* elements of your dataset and whose values are their counts.

You could use to score the location rather than using just the mode. 


In [121]:
from collections import Counter

In [122]:
Counter(["London", "London", "Manchester", "Liverpool"])

Counter({'London': 2, 'Manchester': 1, 'Liverpool': 1})