# Lab 1.02 - Android Persistence

Import all necessary Python libraries and create a variable `android_persistence` to load the dataset [android_persistence_cpu.csv](https://github.com/HoGentTIN/dsai-en-labs/blob/main/data/android_persistence_cpu.csv). See the [code book](https://github.com/HoGentTIN/dsai-en-labs/blob/main/data/android_persistence_cpu.md) for more info on the contents. Note this file is not stored as a regular CSV file! You may need to tweak the parameters of the import function to load the file correctly.

In [32]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation

In [33]:
android_persistence = pd.read_csv('https://raw.githubusercontent.com/HoGentTIN/dsai-labs/refs/heads/main/data/android_persistence_cpu.csv', delimiter=';')

Explore the dataset:

- How many variables and observations are present in the dataset?
- What is the level of measurement of each variable?
- Perform the conversion of the qualitative variables to the appropriate type (and specify the order of ordinal variables).
- List the data types in the dataset.

In [41]:
android_persistence.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Time             300 non-null    float64 
 1   PersistenceType  300 non-null    category
 2   DataSize         300 non-null    category
dtypes: category(2), float64(1)
memory usage: 3.4 KB


In [35]:
android_persistence.head(10)

Unnamed: 0,Time,PersistenceType,DataSize
0,1.81,Sharedpreferences,Small
1,1.35,Sharedpreferences,Small
2,1.84,Sharedpreferences,Small
3,1.54,Sharedpreferences,Small
4,1.81,Sharedpreferences,Small
5,1.82,Sharedpreferences,Small
6,1.79,Sharedpreferences,Small
7,1.57,Sharedpreferences,Small
8,1.78,Sharedpreferences,Small
9,1.79,Sharedpreferences,Small


In [36]:
android_persistence.Time.describe()

count    300.000000
mean       6.230833
std        4.229599
min        1.090000
25%        1.790000
50%        6.185000
75%       10.662500
max       13.560000
Name: Time, dtype: float64

In [37]:
android_persistence.DataSize.value_counts()

DataSize
Small     120
Medium     90
Large      90
Name: count, dtype: int64

- Time is een ratio
- PersitenceType is een nominal
- DataSize is een ordinal

In [38]:
android_persistence['PersistenceType'] = android_persistence["PersistenceType"].astype('category')

In [39]:
datasize_type = CategoricalDtype(categories=['Large', 'Medium', 'Small'], ordered=True)
android_persistence['DataSize'] = android_persistence["DataSize"].astype(datasize_type)

Describe each variable.

In [50]:
print("Time:")
print(android_persistence['Time'].describe())
print("\nDataSize:")
print(android_persistence['DataSize'].describe())
print("\nPersistenceType:")
print(android_persistence['PersistenceType'].describe())

Time:
count    300.000000
mean       6.230833
std        4.229599
min        1.090000
25%        1.790000
50%        6.185000
75%       10.662500
max       13.560000
Name: Time, dtype: float64

DataSize:
count       300
unique        3
top       Small
freq        120
Name: DataSize, dtype: object

PersistenceType:
count          300
unique           4
top       GreenDAO
freq            90
Name: PersistenceType, dtype: object


What unique values are there for the variables `PersistenceType` and `DataSize`? How often does each value occur?

In [48]:
print(android_persistence.PersistenceType.value_counts())
print("\n")
print(android_persistence.DataSize.value_counts())

PersistenceType
GreenDAO             90
Realm                90
SQLLite              90
Sharedpreferences    30
Name: count, dtype: int64


DataSize
Small     120
Large      90
Medium     90
Name: count, dtype: int64


In this dataset, it is especially interesting to know how often each unique combination of `PersistenceType` and `DataSize` occurs. Figure out how to use the Pandas function `crosstab()` to create a so-called contingency table for these variables. By the way, this concept will return in Module 4 (examining the relationship between 2 qualitative variables).

In [53]:
pd.crosstab(android_persistence.PersistenceType, android_persistence.DataSize)

DataSize,Large,Medium,Small
PersistenceType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GreenDAO,30,30,30
Realm,30,30,30
SQLLite,30,30,30
Sharedpreferences,0,0,30
