### __Examining variable types in Pandas__

In [7]:
# import pandas
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')

In [8]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'youtube'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

youtube_df = pd.read_sql_query('select * from youtube',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

To get a high level understanding of the dataframe, we use .info() function, which will return the number of rows in data frame as well as data type of each column

In [9]:
youtube_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Rank           5000 non-null   object
 1   Grade          5000 non-null   object
 2   Channel name   5000 non-null   object
 3   Video Uploads  5000 non-null   object
 4   Subscribers    5000 non-null   object
 5   Video views    5000 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


Dataset contains 5000 observations and 6 column. Only *Video views* is an integer. The rest are 'object' type, meaning they are strings. 

To view the first few rows of the data frame, we use .head( ) function

In [10]:
# print first rows of the data frame
youtube_df.head()

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
0,1st,A++,Zee TV,82757,18752951,20869786591
1,2nd,A++,T-Series,12661,61196302,47548839843
2,3rd,A++,Cocomelon - Nursery Rhymes,373,19238251,9793305082
3,4th,A++,SET India,27323,31180559,22675948293
4,5th,A++,WWE,36756,32852346,26273668433


It appears *Video Uploads* and *Subscribers* are both integers as well but appear as objects. 

Let's take a closer to look to investigate as to why this is the case.

In [11]:
youtube_df[(youtube_df["Video Uploads"].str.strip() == "--") | (youtube_df["Subscribers"].str.strip() == "--")]

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
17,18th,A+,Vlad and Nikita,53,--,1428274554
108,109th,A,BIGFUN,373,--,941376171
115,116th,A,Bee Kids Games - Children TV,740,--,414535723
142,143rd,A,ChiChi TV Siêu Nhân,421,--,2600394871
143,144th,A,MusicTalentNow,1487,--,3252752212
...,...,...,...,...,...,...
4941,"4,942nd",B+,GMTV,183,--,127080542
4948,"4,949th",B+,Keivon ToysReview,468,--,481568513
4956,"4,957th",B+,CLICKNEWS,2661,--,139940815
4961,"4,962nd",B+,ONE Championship,905,--,109836654


We can see that 390 rows in our dataset contain -- value for either Video Uploads or Subscribers column. 

In [12]:
# returns the number of unique values for each column
youtube_df.nunique()

Rank             5000
Grade               6
Channel name     4993
Video Uploads    2286
Subscribers      4612
Video views      5000
dtype: int64

*Grade* has 6 categories, so it is safe to classify it as a categorical variable. 

*Channel name* has nearly 5000 unique values. Is it categorical? We can think of each Channel name as a unique category. The number of possibilities this value can take is limited to the number of Youtube channels, so it's a categorical variable. 

*Rank* can be ordinal categorical variable or interval continuous variable

### __Changing variable types__

We'll sometimes want to work with categorical instead of continuous variable. We can transform a continuous variable into an ordinal, categorical variable. 

In [13]:
# Create a new feature in our DataFrame, views_group:

# this method returns group members
# given video views
def categorize_video_views(views_num):
  if views_num >= 1000000000:
    return 1
  elif views_num >= 100000000:
    return 2
  else:
    return 3

# use Pandas' .apply() method by calling function above
youtube_df['views_group'] = youtube_df['Video views'].apply(categorize_video_views)

# let's see how many observations we have in each group
print(youtube_df.groupby('views_group')["Video views"].count())

views_group
1    1399
2    2846
3     755
Name: Video views, dtype: int64
