# Projekt iz SPI

## Checkpoint 1 za 09.03.2023:

Odabrati skup podataka i napraviti osnovnu analizu podataka u pythonu

Odabran skup podataka: **Finance & Accounting Courses - Udemy (13K+ course)**

Link na dataset: https://www.kaggle.com/datasets/jilkothari/finance-accounting-courses-udemy-13k-course/discussion

Osnovna analiza podataka:

- učitati iz csv u dataframe (pandas)
- pregled prvih 5 redaka
- veličina skupa
- nazivi stupaca
- broj nedostajućih vrijednosti po stupcu (.isna)
- jedinstvene vrijednosti (.unique())
- ispis tipova podataka (.dtypes)
- frekvencije vrijednosti po stupcu (petlja, data[column].value_counts())

Pitanja:

1. Da li je skup podataka dovoljno velik?
2. Da li skup ima dovoljno različite podatke?
3. Da li skup ima vremensku dimenziju?
4. Da li skup ima kvantitativne i kvalitativne podatke?
5. Da li skup ima puno nedostajućih vrijednosti?

Skup, rezultate analize i odgovore na pitanja kratko prezentirati (5 min.) na vježbama 09.03.

### Instalacija paketa

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from collections import Counter

### Učitavanje dataseta u dataframe

In [2]:
df = pd.read_csv('finance-accounting-courses-udemy.csv')

### Pregled prvih 5 redaka

In [3]:
df.head()

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__currency,discount_price__price_string,price_detail__amount,price_detail__currency,price_detail__price_string
0,762616,The Complete SQL Bootcamp 2020: Go from Zero t...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
1,937678,Tableau 2020 A-Z: Hands-On Tableau Training fo...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
2,1361790,PMP Exam Prep Seminar - PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
4,637930,An Entire MBA in 1 Course:Award Winning Busine...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,INR,₹455,8640.0,INR,"₹8,640"


### Veličina skupa


In [4]:
df.shape

(13608, 20)

### Nazivi stupaca

In [5]:
cols = list(df.columns)
cols

['id',
 'title',
 'url',
 'is_paid',
 'num_subscribers',
 'avg_rating',
 'avg_rating_recent',
 'rating',
 'num_reviews',
 'is_wishlisted',
 'num_published_lectures',
 'num_published_practice_tests',
 'created',
 'published_time',
 'discount_price__amount',
 'discount_price__currency',
 'discount_price__price_string',
 'price_detail__amount',
 'price_detail__currency',
 'price_detail__price_string']

### Broj nedostajućih vrijednosti po stupcu

In [6]:
df.isnull().sum()

id                                 0
title                              0
url                                0
is_paid                            0
num_subscribers                    0
avg_rating                         0
avg_rating_recent                  0
rating                             0
num_reviews                        0
is_wishlisted                      0
num_published_lectures             0
num_published_practice_tests       0
created                            0
published_time                     0
discount_price__amount          1403
discount_price__currency        1403
discount_price__price_string    1403
price_detail__amount             497
price_detail__currency           497
price_detail__price_string       497
dtype: int64

Null vrijednosti u stupcima `price_detail__amount`, `price_detail__currency`, `price_detail__price_string` označavaju da je cijena tečaja bila besplatna, a u stupcima `discount_price__amount`, `discount_price__currency`, `discount_price__price_string` označava da tečajevi nisu imali popust. Stoga im se može dodjeliti zadane vrijednost (`0`, `INR`, `₹`).

No postoji jedna greška... Postoji jedan tečaj koji se plača ali nisu upisane vrijednost plače. Stoga taj redak brišemo.

In [7]:
paid_df = df[df['is_paid'] == True]
paid_df = paid_df[paid_df['price_detail__amount'].isnull() == True]
paid_df

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__currency,discount_price__price_string,price_detail__amount,price_detail__currency,price_detail__price_string
13607,2935720,Acabou a Previdência e agora? - Volume 03,/course/acabou-a-previdencia-e-agora-volume-03/,True,0,0.0,0.0,0.0,0,False,14,0,2020-03-30T19:10:58Z,2020-04-02T16:33:32Z,,,,,,


In [8]:
# Dodavanje zadanih vrijednosti
df['discount_price__amount']=df['discount_price__amount'].fillna(value=0.0)
df['discount_price__currency']=df['discount_price__currency'].fillna(value='INR')
df['discount_price__price_string']=df['discount_price__price_string'].fillna(value='₹0')
df['price_detail__amount']=df['price_detail__amount'].fillna(value=0.0)
df['price_detail__currency']=df['price_detail__currency'].fillna(value='INR')
df['price_detail__price_string']=df['price_detail__price_string'].fillna(value='₹0')

In [9]:
# Test
df[df['id'] == 2935720]

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__currency,discount_price__price_string,price_detail__amount,price_detail__currency,price_detail__price_string
13607,2935720,Acabou a Previdência e agora? - Volume 03,/course/acabou-a-previdencia-e-agora-volume-03/,True,0,0.0,0.0,0.0,0,False,14,0,2020-03-30T19:10:58Z,2020-04-02T16:33:32Z,0.0,INR,₹0,0.0,INR,₹0


In [10]:
# Brisanje zadnje retka
idx = df[df['id'] == 2935720].index
df = df.drop(idx)
df.shape

(13607, 20)

### Jedinstvene vrijednosti

In [11]:
cols_uniques = {col: len(df[col].unique()) for col in cols}
cols_uniques

{'id': 13607,
 'title': 13562,
 'url': 13607,
 'is_paid': 2,
 'num_subscribers': 4875,
 'avg_rating': 1965,
 'avg_rating_recent': 11781,
 'rating': 11781,
 'num_reviews': 1285,
 'is_wishlisted': 1,
 'num_published_lectures': 301,
 'num_published_practice_tests': 7,
 'created': 13606,
 'published_time': 13604,
 'discount_price__amount': 54,
 'discount_price__currency': 1,
 'discount_price__price_string': 54,
 'price_detail__amount': 38,
 'price_detail__currency': 1,
 'price_detail__price_string': 38}

### Tipovi podataka

In [12]:
df.dtypes

id                                int64
title                            object
url                              object
is_paid                            bool
num_subscribers                   int64
avg_rating                      float64
avg_rating_recent               float64
rating                          float64
num_reviews                       int64
is_wishlisted                      bool
num_published_lectures            int64
num_published_practice_tests      int64
created                          object
published_time                   object
discount_price__amount          float64
discount_price__currency         object
discount_price__price_string     object
price_detail__amount            float64
price_detail__currency           object
price_detail__price_string       object
dtype: object

### Frekvencije vrijednosti po stupcu

In [13]:
for col in cols:
    print(df[col].value_counts())
    print()

762616     1
2798982    1
2492106    1
2446352    1
2508636    1
          ..
1161294    1
1355540    1
2295319    1
2291453    1
3211345    1
Name: id, Length: 13607, dtype: int64

Shopify Dropshipping - Scale to 7 figures with Clickfunnels!    2
Basics of Accounts and Finance made simple                      2
Fundamentals of Change Management                               2
Sales Fundamentals                                              2
Personal Finance 101                                            2
                                                               ..
Debit Spread For Half The Cost - Options Trading Reinvented     1
Freelance Mindset - Become an Unstoppable Freelance Force!      1
Forex Trading: A Simple Unique Approach For Massive Gains       1
Conversion Mastery: How to Optimize ANY Ecommerce Website       1
Poderoso Investidor                                             1
Name: title, Length: 13562, dtype: int64

/course/the-complete-sql-bootcamp/                

### Dodatne provjere

Provjera u slučaju ako je broj recenzija u nekim tečajevima veći od broj pretplatnika

In [14]:
subscriber_review_count_diff = df['num_reviews']>df['num_subscribers']
subscriber_review_count_diff.sum()

5

In [15]:
index = df[subscriber_review_count_diff].index
df = df.drop(index= index)
df.shape

(13602, 20)

In [16]:
df['id'].unique().size

13602

Broj pojedinih ID-jeva se poklapa s brojem stupaca što znači da se redovi ne ponavljaju

### Pitanja

1. [x] Da li je skup podataka dovoljno velik?
Sadrži `13602` redaka i `20` stupaca nakon obrade podataka. Dovoljno je.

2. [x] Da li skup ima dovoljno različite podatke?
Ima 13 stupaca i tipovi podataka su: `bool`, `float64`, `int64` i `object` tj. string i datum. Ima dovoljno različite podatke.

3. [x] Da li skup ima vremensku dimenziju?
Da, `created`, `published_time`.

4. [x] Da li skup ima kvantitativne i kvalitativne podatke?
Skup ima kvalitativne (npr. `is_paid`) i kvalitativne (npr. `num_subscribers`).

5. [ ] Da li skup ima puno nedostajućih vrijednosti?
Nedostajućih vrijednosti je bilo ukupno `5700` od `272140` (`2,09%`) čelija i obrađeni su. Sve čelije su popunjene. Također se ustanovilo da u nekim tečajevima postoji veći broj recenzija od pretplate, što je nemoguće i nelogično, stoga su ti redovi bili izbačeni. 
