# Fundamentals of Data Analysis Tasks

---

**Author: Damien Farrell**

---

## Task One

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence $1, 4, 2, 1, 4, 2, ...$
> $$
> f(n) = 
> \begin{cases} 
> n \div 2 & \text{if } n \text{ is even} \\
> 3n + 1 & \text{if } n \text{ is odd}
> \end{cases}
> $$
> Your task is to verify, using Python, that the conjecture is true for the first 10,000 positive integers.

**Function f(x) to carry out the calculation.**

In [1]:
from tqdm.notebook import tqdm

In [2]:
def f(x):
    if x % 2 == 0:
        return x // 2
    else:
        return (3 * x) + 1

**Fuction collatz(x) to loop through the calculation until $x = 1$.**

In [3]:
def collatz(x, verbose=False):
    while x != 1:
        if verbose:
            print(x, end=', ')
        x = f(x)
    if verbose:
        print(x, end='\n')
        print()
    return x

**Fuction verify(y) to verify the Collatz conjecture on $y$ number of positive integers.**

In [4]:
def verify(y, verbose=False):
    for i in range(1, y + 1):
        result = collatz(i, verbose)
        if result != 1:
            print(f'The Collatz conjecture is false, it failed on integer {i}')
            return

    print(f'The Collatz conjecture is verified to be true on {y:,} number of positive integers')

<p>To show the calculation on each iteration, set verbose = True.<sup id="fnref:1"><a href="#fn:1">1</a></sup></p>

In [5]:
verify(10_000, verbose=False)

The Collatz conjecture is verified to be true on 10,000 number of positive integers


---

## Task Two

> Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

In [6]:
import pandas as pd
import numpy as np

url_name = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv'

df = pd.read_csv(url_name)

In [7]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [8]:
df.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

<p>To determine the unique values for the categorical variables:<sup id="fnref:1"><a href="#fn:1">2</a></sup></p>

In [10]:
categorical_variables = df[["species","island","sex"]]

for variables in categorical_variables:
  print(df[variables].unique())

['Adelie' 'Chinstrap' 'Gentoo']
['Torgersen' 'Biscoe' 'Dream']
['MALE' 'FEMALE' nan]


### To Summarise:
**Species (Categorical Variable):** <br>
This variable represents the species of penguins, and it is categorical. It has three possible categories: Adelie, Chinstrap, and Gentoo. In the dataframe it is an object which contain strings. An object is a catch-all data type that can represent various types of data.

**Island (Categorical Variable):** <br>
The dataset includes the island where the penguins were observed: Biscoe, Dream, and Torgersen. Like the "Species" variable, this is a categorical variable. In the dataframe it is an object which contain strings.

**Bill Length (Numerical Variable):** <br>
This variable represents the length of the penguin's culmen bill. It is a numerical variable which is a floating-point.

**Bill Depth (Numerical Variable):** <br>
Similar to culmen length, this variable represents the depth of the penguin's bill. It is a numerical variable which is a floating-point.

**Flipper Length (Numerical Variable):** <br>
This variable represents the length of the penguin's flipper. It is a numerical variable which is a floating-point.

**Body Mass (Numerical Variable):** <br>
Body mass represents the weight of the penguins. It is a numerical variable which is a floating-point.

**Sex (Categorical Variable):** <br>
The dataset includes the sex of the penguins, which is categorical and has three categories: Male, Female, and NaN (missing data). In the dataframe it is an object which contains strings and null values.


---
## Task Three



# References
---

<ol>
  <li>
    <a href="https://stackoverflow.com/questions/5980042/how-to-implement-the-verbose-or-v-option-into-a-script" id="fn:1">How to implement the --verbose or -v option into a script?</a>
  </li>
  <li>
    <a href="https://www.statology.org/pandas-unique-values-in-column/" id="fn:2">Pandas: How to Find Unique Values in a Column</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_THIRD_LINK" id="fn:3">Third Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_FOURTH_LINK" id="fn:4">Fourth Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_FIFTH_LINK" id="fn:5">Fifth Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_SIXTH_LINK" id="fn:6">Sixth Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_SEVENTH_LINK" id="fn:7">Seventh Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_EIGHTH_LINK" id="fn:8">Eighth Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_NINTH_LINK" id="fn:9">Ninth Link Text</a>
  </li>
  <li>
    <a href="URL_OF_YOUR_TENTH_LINK" id="fn:10">Tenth Link Text</a>
  </li>
</ol>



***
# End