# Ex3 - Getting and Knowing your Data

Check out [Occupation Exercises Video Tutorial](https://www.youtube.com/watch?v=W8AB5s-L3Rw&list=PLgJhDSE2ZLxaY_DigHeiIDC1cD09rXgJv&index=4) to watch a data scientist go through the exercises

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
import polars as pl
import time

start_time = time.time()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [2]:
dtypes = {'gender': pl.Categorical, 'occupation': pl.Categorical}
users = pl.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      separator='|', dtypes=dtypes)

### Step 4. See the first 25 entries

In [3]:
users.head(25)

user_id,age,gender,occupation,zip_code
i64,i64,cat,cat,str
1,24,"""M""","""technician""","""85711"""
2,53,"""F""","""other""","""94043"""
3,23,"""M""","""writer""","""32067"""
4,24,"""M""","""technician""","""43537"""
5,33,"""F""","""other""","""15213"""
6,42,"""M""","""executive""","""98101"""
7,57,"""M""","""administrator""","""91344"""
8,36,"""M""","""administrator""","""05201"""
9,29,"""M""","""student""","""01002"""
10,53,"""M""","""lawyer""","""90703"""


### Step 5. See the last 10 entries

In [4]:
users.tail(10)

user_id,age,gender,occupation,zip_code
i64,i64,cat,cat,str
934,61,"""M""","""engineer""","""22902"""
935,42,"""M""","""doctor""","""66221"""
936,24,"""M""","""other""","""32789"""
937,48,"""M""","""educator""","""98072"""
938,38,"""F""","""technician""","""55038"""
939,26,"""F""","""student""","""33319"""
940,32,"""M""","""administrator""","""02215"""
941,20,"""M""","""student""","""97229"""
942,48,"""F""","""librarian""","""78209"""
943,22,"""M""","""student""","""77841"""


### Step 6. What is the number of observations in the dataset?

In [5]:
users.height

943

### Step 7. What is the number of columns in the dataset?

In [6]:
users.width

5

### Step 8. Print the name of all the columns.

In [7]:
users.columns

['user_id', 'age', 'gender', 'occupation', 'zip_code']

### Step 10. What is the data type of each column?

In [8]:
users.schema

{'user_id': Int64,
 'age': Int64,
 'gender': Categorical,
 'occupation': Categorical,
 'zip_code': Utf8}

### Step 11. Print only the occupation column

In [9]:
users.select('occupation')

#or

users['occupation']

occupation
cat
"""technician"""
"""other"""
"""writer"""
"""technician"""
"""other"""
"""executive"""
"""administrator"""
"""administrator"""
"""student"""
"""lawyer"""


### Step 12. How many different occupations are in this dataset?

In [10]:
users['occupation'].n_unique()
#or by using value_counts() which returns the count of unique elements
#users.occupation.value_counts().count()

21

### Step 13. What is the most frequent occupation?

In [11]:
users['occupation'].value_counts().top_k(5, by='counts')

occupation,counts
cat,u32
"""student""",196
"""other""",105
"""educator""",95
"""administrator""",79
"""engineer""",67


### Step 14. Summarize the DataFrame.

In [12]:
users.describe()

describe,user_id,age,gender,occupation,zip_code
str,f64,f64,str,str,str
"""count""",943.0,943.0,"""943""","""943""","""943"""
"""null_count""",0.0,0.0,"""0""","""0""","""0"""
"""mean""",472.0,34.051962,,,
"""std""",272.364951,12.19274,,,
"""min""",1.0,7.0,,,"""00000"""
"""max""",943.0,73.0,,,"""Y1A6B"""
"""median""",472.0,31.0,,,
"""25%""",236.0,25.0,,,
"""75%""",708.0,43.0,,,


### Step 15. Summarize only the numeric columns

In [13]:
users.select(pl.col(pl.INTEGER_DTYPES)).describe()

describe,user_id,age
str,f64,f64
"""count""",943.0,943.0
"""null_count""",0.0,0.0
"""mean""",472.0,34.051962
"""std""",272.364951,12.19274
"""min""",1.0,7.0
"""max""",943.0,73.0
"""median""",472.0,31.0
"""25%""",236.0,25.0
"""75%""",708.0,43.0


### Step 16. Summarize only the occupation column

In [14]:
users['occupation'].cast(pl.Utf8).describe()

statistic,value
str,i64
"""count""",943
"""null_count""",0
"""unique""",21


### Step 17. What is the mean age of users?

In [15]:
round(users['age'].mean())

34

### Step 18. What is the age with least occurrence?

In [16]:
users['age'].value_counts().bottom_k(5, by='counts') #7, 10, 11, 66 and 73 years -> only 1 occurrence

age,counts
i64,u32
73,1
11,1
66,1
7,1
10,1


In [17]:
print(f"Time elapsed: {time.time() - start_time}")

Time elapsed: 0.1845848560333252
