# **Title**

## **Data Collection**

## Objectives

* Fetch data from Kaggle
* Prepare data for further processes

## Inputs

* Kaggle JSON file - the authentication token

## Outputs

* Generate Dataset: inputs/dataset/fossils

---

## Install the required libraries

In [1]:
%pip install kaggle pandas

Note: you may need to restart the kernel to use updated packages.


## Import Libraries

In [2]:
import os
import pandas as pd

## Change working directory

* to save the data in a subfolder that is separate from the notebooks, we need to change the working directory from its current folder to its parent folder.

* We access the current directory with os.getcwd()

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Victo\\IBM-machine-learning-certification\\EDA\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Victo\\IBM-machine-learning-certification\\EDA'

## Reading and understanding our data

In [7]:
!kaggle datasets download -d "stealthtechnologies/predict-the-age-of-a-fossil"

Dataset URL: https://www.kaggle.com/datasets/stealthtechnologies/predict-the-age-of-a-fossil
License(s): MIT
Downloading predict-the-age-of-a-fossil.zip to c:\Users\Victo\IBM-machine-learning-certification\EDA




  0%|          | 0.00/302k [00:00<?, ?B/s]
100%|██████████| 302k/302k [00:00<00:00, 1.48MB/s]
100%|██████████| 302k/302k [00:00<00:00, 1.48MB/s]


In [8]:
import zipfile
with zipfile.ZipFile("predict-the-age-of-a-fossil.zip", 'r') as zip_ref:
    zip_ref.extractall("data")

In [9]:
if os.path.exists("predict-the-age-of-a-fossil.zip"):
  os.remove("predict-the-age-of-a-fossil.zip")
else:
  print("The file does not exist")

In [10]:
df1 = pd.read_csv("data/test_data.csv")
df1.head()

Unnamed: 0,uranium_lead_ratio,carbon_14_ratio,radioactive_decay_series,stratigraphic_layer_depth,geological_period,paleomagnetic_data,inclusion_of_other_fossils,isotopic_composition,surrounding_rock_type,stratigraphic_position,fossil_size,fossil_weight,age
0,0.469986,1.0,0.667595,29.58,Triassic,Normal polarity,False,0.58356,Limestone,Bottom,120.12,73.83,41072
1,0.619865,0.474208,1.218381,69.87,Cretaceous,Reversed polarity,True,0.942719,Shale,Middle,72.82,191.68,42085
2,0.767736,0.478731,0.119801,96.38,Cretaceous,Normal polarity,False,0.377531,Sandstone,Bottom,105.47,82.25,50436
3,0.275121,0.400594,0.63476,134.1,Triassic,Normal polarity,True,0.32382,Sandstone,Middle,94.99,47.99,25923
4,0.40747,0.039705,0.824597,124.1,Triassic,Normal polarity,False,1.21912,Shale,Middle,139.93,532.62,30272


In [11]:
df2 = pd.read_csv("data/train_data.csv")
df2.head()

Unnamed: 0,uranium_lead_ratio,carbon_14_ratio,radioactive_decay_series,stratigraphic_layer_depth,geological_period,paleomagnetic_data,inclusion_of_other_fossils,isotopic_composition,surrounding_rock_type,stratigraphic_position,fossil_size,fossil_weight,age
0,0.738061,0.487707,0.907884,91.17,Cretaceous,Normal polarity,False,0.915951,Conglomerate,Middle,50.65,432.0,43523
1,0.560096,0.341738,1.121302,165.44,Cambrian,Normal polarity,False,0.803968,Limestone,Top,48.85,353.29,44112
2,0.424773,0.218493,0.103855,218.98,Cambrian,Normal polarity,True,0.792441,Shale,Bottom,37.66,371.33,43480
3,0.349958,0.704649,0.383617,51.09,Permian,Normal polarity,True,0.074636,Limestone,Bottom,39.1,232.84,30228
4,0.886811,0.777494,0.593254,313.72,Devonian,Normal polarity,True,1.64664,Shale,Top,90.84,277.67,67217


In [12]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   uranium_lead_ratio          1100 non-null   float64
 1   carbon_14_ratio             1100 non-null   float64
 2   radioactive_decay_series    1100 non-null   float64
 3   stratigraphic_layer_depth   1100 non-null   float64
 4   geological_period           1100 non-null   object 
 5   paleomagnetic_data          1100 non-null   object 
 6   inclusion_of_other_fossils  1100 non-null   bool   
 7   isotopic_composition        1100 non-null   float64
 8   surrounding_rock_type       1100 non-null   object 
 9   stratigraphic_position      1100 non-null   object 
 10  fossil_size                 1100 non-null   float64
 11  fossil_weight               1100 non-null   float64
 12  age                         1100 non-null   int64  
dtypes: bool(1), float64(7), int64(1),

In [13]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4398 entries, 0 to 4397
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   uranium_lead_ratio          4398 non-null   float64
 1   carbon_14_ratio             4398 non-null   float64
 2   radioactive_decay_series    4398 non-null   float64
 3   stratigraphic_layer_depth   4398 non-null   float64
 4   geological_period           4398 non-null   object 
 5   paleomagnetic_data          4398 non-null   object 
 6   inclusion_of_other_fossils  4398 non-null   bool   
 7   isotopic_composition        4398 non-null   float64
 8   surrounding_rock_type       4398 non-null   object 
 9   stratigraphic_position      4398 non-null   object 
 10  fossil_size                 4398 non-null   float64
 11  fossil_weight               4398 non-null   float64
 12  age                         4398 non-null   int64  
dtypes: bool(1), float64(7), int64(1),

In [14]:
df1.columns.values.tolist() == df2.columns.values.tolist()

True

In [15]:
frames = [df1, df2]
fossils = pd.concat(frames, ignore_index=True)
fossils.drop_duplicates()
fossils

Unnamed: 0,uranium_lead_ratio,carbon_14_ratio,radioactive_decay_series,stratigraphic_layer_depth,geological_period,paleomagnetic_data,inclusion_of_other_fossils,isotopic_composition,surrounding_rock_type,stratigraphic_position,fossil_size,fossil_weight,age
0,0.469986,1.000000,0.667595,29.58,Triassic,Normal polarity,False,0.583560,Limestone,Bottom,120.12,73.83,41072
1,0.619865,0.474208,1.218381,69.87,Cretaceous,Reversed polarity,True,0.942719,Shale,Middle,72.82,191.68,42085
2,0.767736,0.478731,0.119801,96.38,Cretaceous,Normal polarity,False,0.377531,Sandstone,Bottom,105.47,82.25,50436
3,0.275121,0.400594,0.634760,134.10,Triassic,Normal polarity,True,0.323820,Sandstone,Middle,94.99,47.99,25923
4,0.407470,0.039705,0.824597,124.10,Triassic,Normal polarity,False,1.219120,Shale,Middle,139.93,532.62,30272
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5493,0.049660,0.601750,0.762490,222.54,Jurassic,Reversed polarity,True,2.247495,Sandstone,Bottom,91.69,415.13,26606
5494,0.360085,0.215033,1.002406,276.70,Cretaceous,Reversed polarity,True,1.004584,Conglomerate,Bottom,68.97,121.10,44850
5495,0.464864,0.553313,0.659639,76.77,Devonian,Normal polarity,True,0.721947,Conglomerate,Middle,11.37,288.73,32186
5496,0.803338,0.272392,0.123562,204.82,Neogene,Reversed polarity,True,1.496427,Sandstone,Bottom,132.34,518.31,59888
