# Fundamentals of Data Analysis Tasks

**Conor Tierney**

***

# Task 1 - Collatz Conjecture

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .
> The task is to verify, using Python, that the conjecture is true for the first 10,000 positive integers.
> At each stage of the loop, the program updates the number value based on whether it is even or odd. If the number $x$ is even, divide it by 2. If the number is odd, multiply by 3 and add 1. When number becomes 1, the program ends.


In [1]:
def f(x):
    # If x is even, divide it by two.
    if x % 2 == 0:
        return x // 2                      
    # If x is odd, multiply by 3 + 1.
    else:
        return (3 * x) + 1

In [2]:
def collatz(x):
    while x != 1:                                      # while loop executes as long as 'x' is not = 1.
        print (x, end=', ')                            # until x becomes one , keep printing value of x.
        x = f(x)
    print(x) 

> Verify the conjecture for the first 10,000 integers is true

In [3]:
collatz(10000)

10000, 5000, 2500, 1250, 625, 1876, 938, 469, 1408, 704, 352, 176, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1


### References:
-
-
-


## End -  Task 1

***
***

# Task 2 - Penguins Dataset Overview

The famous Palmer penguin dataset is commonly used in data science and analytics. 
The dataset comprises information on penguins inhabiting the Palmer Archipelago, Antarctica. It features various attributes, including the name of the islands (Torgersen, Biscoe, or Dream), the species (Adelie, Chinstrap, or Gentoo), bill length measurement (mm), bill depth measurement (mm), flipper length (mm), body mass (grams), and the gender of the penguins. 
The data was collected and made available by Dr. Kirsten Gorman and Palmer station Antarctica LTER between the period 2007-09.
In total 344 samples were collected (however 2 samples have missing structural size measurements).

![Penguins Dataset](https://imgur.com/orZWHly.png)

## Types of Variables in the Dataset

The dataset contains Categorical and Numerical Variables. Of the 7 columns in the dataset, 3 are categorical (gender, island and species), and 4 are numeric (bill length, bill depth, flipper length and body mass)

### **Categorical variables**:

1. *Species: The represents the penguin species and contains 3 categories.*
- Adelie.
- Chinstrap.
- Gentoo.   
  
2. *Island: Where the penguins were observed.*
- Torgersen.
- Biscoe.
- Dream.   
  
3. *Sex: Gender of the penguins.*
- Male.
- Female.

### **Numeric variables**:

1. Bill Length.
2. Bill Depth.
3. Flipper Length.
4. Body Mass (g)


### Investigation of the the Dataset using Pandas.

In [2]:
import pandas as pd

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
# read in the csv file

In [12]:
print(df)
# show summary of the dataset

    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    MALE  
1         3800.0  FEMALE  
2     

In [14]:
df.head()
# Verify structure

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [8]:
df.describe()         
# describe dataframe.
# can use x.describe() to select and describe a variable defined as 'x'.

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [11]:
df.mean()          
# These are the means for the 4 numeric variables.

  df.mean()


bill_length_mm         43.921930
bill_depth_mm          17.151170
flipper_length_mm     200.915205
body_mass_g          4201.754386
dtype: float64

### References:
- https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95
- https://inria.github.io/scikit-learn-mooc/python_scripts/trees_dataset.html
- https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris (for penguin image)
- wikipedia : penguin dataset

## End -  Task 2

***
***

# Task 3