# Fundamentals of Data Analysis Tasks

**Declan Fox**

## Task 1

>The Collatz conjecture1 is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive
integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .
>$$f(x) = 
\begin{cases}
{x ÷ 2}, & \text{if x is even} \\
{3x + 1}, & \text{otherwise}
\end{cases}$$
>For example, starting with the value 10, which is an even number,
we divide it by 2 to get 5. Then 5 is an odd number so, we multiply by 3 and add 1 to get 16. Then we repeatedly divide by 2 to
get 8, 4, 2, 1. Once we are at 1, we go back to 4 and get stuck in the
repeating sequence 4, 2, 1 as we suspected.
This task is to verify, using Python, that the conjecture is true for
the first 10,000 positive integers.


In [1]:
def f(x):
    #if x is even, divide by 2
    if x % 2 == 0:
        return x // 2
    else:
        return (3 * x) + 1

The following function will return a list of all the integers resulting from testing Collatz for a given integer $x$

In [2]:
def collatz(x):
    ints = []
    while x != 1:
        x = f(x)
        ints.append(x)
    return ints

In [3]:
for i in range(1, 5):
    print(f'Output for positive integer {i}:')
    print(collatz(i))

Output for positive integer 1:
[]
Output for positive integer 2:
[1]
Output for positive integer 3:
[10, 5, 16, 8, 4, 2, 1]
Output for positive integer 4:
[2, 1]


As we have tested integers 1 to 4 and found that they uphold the Collatz conjecture and assuming that if a list of integers resulting from calculating an integer that is $> 4$ and is Collatz compliant ends in $4, 2, 1$ then the conjecture is true for that integer.

In [4]:
# test numbers 4 to 10000
for i in range(1, 10001):
    if i > 4:
        # check last 3 numbers
        if (collatz(i)[-3:]) == [4, 2, 1]:
            latest_num = i
        else: break # exit loop if last 3 numbers of list are not 4, 2 and 1
print(f'numbers 1 to {latest_num}: are Collatz Compliant')

numbers 1 to 10000: are Collatz Compliant


## Task 2

> Give an overview of the famous penguins data set, explaining 
the types of variables it contains. Suggest the types of variables
that should be used to model them in Python, explaining your
rationale.

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")

In [7]:
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


## Task 3

> For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list
is the most appropriate to model the variable.

## Task 4

> Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus
p.

## Task 5

> Create an appropriate individual plot for each of the variables in the penguin data set.

***

## End