In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('final_data.csv', index_col=0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1500 entries, 0 to 1499
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   repo_name       1500 non-null   object
 1   star            1500 non-null   object
 2   fork            1500 non-null   object
 3   watch           1500 non-null   object
 4   issue           1500 non-null   object
 5   tags            1500 non-null   object
 6   description     1480 non-null   object
 7   contributers    1500 non-null   object
 8   license         1500 non-null   object
 9   repo_url        1500 non-null   object
 10  most_used_lang  1297 non-null   object
dtypes: object(11)
memory usage: 140.6+ KB


In [3]:
data.head()

Unnamed: 0,repo_name,star,fork,watch,issue,tags,description,contributers,license,repo_url,most_used_lang
0,keras,47.9k,18.1k,2.1k,2940,"['deep-learning', 'tensorflow', 'neural-networ...",Deep Learning for humans,49,View license,https://github.com/keras-team/keras,Python
1,scikit-learn,40.3k,19.6k,2.2k,1505,"['machine-learning', 'python', 'statistics', '...",scikit-learn: machine learning in Python,108,View license,https://github.com/scikit-learn/scikit-learn,Python
2,PythonDataScienceHandbook,23.1k,9.9k,1.5k,65,"['scikit-learn', 'numpy', 'python', 'jupyter-n...",Python Data Science Handbook: full text in Jup...,0,View license,https://github.com/jakevdp/PythonDataScienceHa...,Jupyter Notebook
3,Probabilistic-Programming-and-Bayesian-Methods...,21k,6.6k,1.4k,127,"['bayesian-methods', 'pymc', 'mathematical-ana...","aka ""Bayesian Methods for Hackers"": An introdu...",0,MIT,https://github.com/CamDavidsonPilon/Probabilis...,Jupyter Notebook
4,Data-Science--Cheat-Sheet,18.4k,8.2k,1.5k,7,[],Cheat Sheets,0,Fetching contributors,https://github.com/abhat222/Data-Science--Chea...,


## Cleaning

So, we can see in the info section that some of the rows in the `description` and `most_used_lang` are null. I think most used language feature affects the popularity of the repo because the projects using trending technologies is the project most viewed by people (Rule). Then I think we need to remove the rows which has null `most_used_lang` value. Let's see what happens.🙂

In [4]:
data = data.dropna(axis=0, subset=['most_used_lang'], inplace=False).copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1297 entries, 0 to 1499
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   repo_name       1297 non-null   object
 1   star            1297 non-null   object
 2   fork            1297 non-null   object
 3   watch           1297 non-null   object
 4   issue           1297 non-null   object
 5   tags            1297 non-null   object
 6   description     1278 non-null   object
 7   contributers    1297 non-null   object
 8   license         1297 non-null   object
 9   repo_url        1297 non-null   object
 10  most_used_lang  1297 non-null   object
dtypes: object(11)
memory usage: 121.6+ KB


We can see after removing rows according to null values in `most_used_lang` column, `description` col also has some null values. We can say that having description of repo makes it more attractive or easy for other's to understand the project. It will affects the popularity somehow.

In [5]:
print(data[data['description'].isnull()][['most_used_lang', 'repo_url', 'star']])

     most_used_lang                                           repo_url   star
202             C++  https://github.com/oreillymedia/Learning-OpenC...   1.4k
214          Python  https://github.com/MicrocontrollersAndMore/Ope...    405
219            Java                   https://github.com/xikuqi/OpenCV    317
234             C++  https://github.com/MicrocontrollersAndMore/Ope...    217
242             C++         https://github.com/saki4510t/OpenCVwithUVC    154
245          Python  https://github.com/MicrocontrollersAndMore/Ras...    143
253          Python  https://github.com/MicrocontrollersAndMore/Ope...    128
269             C++  https://github.com/MicrocontrollersAndMore/Ope...    105
291             C++  https://github.com/MicrocontrollersAndMore/Ope...     74
356             CSS  https://github.com/handong1587/handong1587.git...     3k
428          Python   https://github.com/martinarjovsky/WassersteinGAN   2.6k
449          Python            https://github.com/hanzhanggit/St

We can see above that the non-descripted repo also has popularity. So, I think other than removing them I will replace it with `''`.

In [6]:
data['description'] = data['description'].fillna('', inplace=False)
len(data[data['description'].isnull()])

0

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1297 entries, 0 to 1499
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   repo_name       1297 non-null   object
 1   star            1297 non-null   object
 2   fork            1297 non-null   object
 3   watch           1297 non-null   object
 4   issue           1297 non-null   object
 5   tags            1297 non-null   object
 6   description     1297 non-null   object
 7   contributers    1297 non-null   object
 8   license         1297 non-null   object
 9   repo_url        1297 non-null   object
 10  most_used_lang  1297 non-null   object
dtypes: object(11)
memory usage: 121.6+ KB


## Visualization