<a href="https://colab.research.google.com/github/febixcf/tds-graded/blob/main/tds_project1_questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
users = pd.read_csv('users.csv', parse_dates=['created_at'])
repos = pd.read_csv('repositories.csv')

In [None]:
users.head()

In [4]:
repos.columns

Index(['login', 'full_name', 'created_at', 'stargazers_count',
       'watchers_count', 'language', 'has_projects', 'has_wiki',
       'license_name'],
      dtype='object')

In [5]:
users.columns

Index(['login', 'name', 'company', 'location', 'email', 'hireable', 'bio',
       'public_repos', 'followers', 'following', 'created_at'],
      dtype='object')

### 1 Who are the top 5 users in Mumbai with the highest number of followers? List their login in order, comma-separated.

In [6]:
# argsort returns the indices of the values in the order they are sorted
idx = np.argsort(users['followers'].values)[::-1][:5]
users.loc[idx, 'login'].values

array(['ValentineFernandes', 'kovidgoyal', 'slidenerd', 'aryashah2k',
       'coding-parrot'], dtype=object)

### 2. Who are the 5 earliest registered GitHub users in Mumbai? List their login in ascending order of created_at, comma-separated.

In [7]:
users.sort_values('created_at').head(5)['login'].values

array(['ivank', 'sandeepshetty', 'svs', 'nitinhayaran', 'nischal'],
      dtype=object)

###3. What are the 3 most popular license among these users? Ignore missing licenses. List the license_name in order, comma-separated.

In [8]:
repos['license_name'].value_counts().head(3).index.values

array(['mit', 'apache-2.0', 'other'], dtype=object)

### 4. Which company do the majority of these developers work at?

In [9]:
users['company'].value_counts().head(1).index[0]

'MASAI SCHOOL'

### 5. Which programming language is most popular among these users?

In [10]:
repos['language'].value_counts().head(1).index[0]

'JavaScript'

### 6. Which programming language is the second most popular among users who joined after 2020?

In [11]:
#  Merge the users and repos dataframe so that we can use it to filter the users who joined after 2020
merged_df = repos.merge(users, how='left', on='login')
merged_df.head()

Unnamed: 0,login,full_name,created_at_x,stargazers_count,watchers_count,language,has_projects,has_wiki,license_name,name,company,location,email,hireable,bio,public_repos,followers,following,created_at_y
0,ValentineFernandes,ValentineFernandes/Age-Calculator-,2022-08-17T06:32:19Z,13,13,CSS,True,True,mit,Valentine Fernandes,,"Mumbai, India",,,HTML | CSS | JS | SQL | MYSQL | JAVA,66,5246,5275,2022-01-29 08:11:37+00:00
1,ValentineFernandes,ValentineFernandes/ASP.NET-,2022-04-26T10:12:11Z,18,18,ASP.NET,True,True,,Valentine Fernandes,,"Mumbai, India",,,HTML | CSS | JS | SQL | MYSQL | JAVA,66,5246,5275,2022-01-29 08:11:37+00:00
2,ValentineFernandes,ValentineFernandes/Assignment-4.2,2022-04-14T11:55:25Z,15,15,HTML,True,True,,Valentine Fernandes,,"Mumbai, India",,,HTML | CSS | JS | SQL | MYSQL | JAVA,66,5246,5275,2022-01-29 08:11:37+00:00
3,ValentineFernandes,ValentineFernandes/Bank-Management-System,2022-04-24T16:24:17Z,26,26,C,True,True,,Valentine Fernandes,,"Mumbai, India",,,HTML | CSS | JS | SQL | MYSQL | JAVA,66,5246,5275,2022-01-29 08:11:37+00:00
4,ValentineFernandes,ValentineFernandes/BMI-Calculator-Website,2022-08-17T04:47:27Z,11,11,HTML,True,True,mit,Valentine Fernandes,,"Mumbai, India",,,HTML | CSS | JS | SQL | MYSQL | JAVA,66,5246,5275,2022-01-29 08:11:37+00:00


In [12]:
# Filter the users who joined after 2020 and get the second most popular programming language
merged_df['created_at_y'] = pd.to_datetime(merged_df['created_at_y'])
merged_df[merged_df['created_at_y'].dt.year > 2020]['language'].value_counts().head(2).index[1]

'HTML'

### 7. Which language has the highest average number of stars per repository?

In [13]:
repos.groupby('language')['stargazers_count'].mean().sort_values(ascending=False).head(1).index[0]

'TSQL'

### 8. Let's define leader_strength as followers / (1 + following). Who are the top 5 in terms of leader_strength? List their login in order, comma-separated.

In [14]:
# creating a leader_strenght column
users['leader_strength'] = users['followers'] / (1 + users['following'])
users.sort_values('leader_strength', ascending=False).head(5)['login'].values

array(['kovidgoyal', 'coding-parrot', 'gkcs', 'slidenerd', 'dmalvia'],
      dtype=object)

### 9. What is the correlation between the number of followers and the number of public repositories among users in Mumbai?

In [15]:
users['public_repos'].corr(users['followers'])

0.03461479920661342

### 10. Does creating more repos help users get more followers? Using regression, estimate how many additional followers a user gets per additional public repository.

In [16]:
from sklearn.linear_model import LinearRegression

input = users['public_repos'].values.reshape(-1, 1)
output = users['followers'].values.reshape(-1, 1)
model = LinearRegression()
model.fit(input, output)

model.coef_[0][0]

0.10108026946686971

### 11. Do people typically enable projects and wikis together? What is the correlation between a repo having projects enabled and having wiki enabled?

In [28]:
from scipy.stats import chi2_contingency

def phi_coefficient(x, y):
    contingency_table = pd.crosstab(x, y)
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum().sum()
    phi = np.sqrt(chi2 / n)
    return phi

# Example DataFrame
# repos = pd.DataFrame({'has_projects': [...], 'has_wiki': [...]})

# Calculate Phi Coefficient
phi_result = phi_coefficient(repos['has_projects'], repos['has_wiki'])
print("Phi Coefficient:", phi_result)

Phi Coefficient: 0.16082282399657025


### 12. Do hireable users follow more people than those who are not hireable?

In [18]:
avg_followers_hireable_users = users.groupby('hireable')['following'].mean().values[0]
avg_followers_non_hireable_users = users[users['hireable'].isna()]['following'].mean()

diff = avg_followers_hireable_users - avg_followers_non_hireable_users
diff

10.073181216931204

### 13. Some developers write long bios. Does that help them get more followers? What's the correlation of the length of their bio (in Unicode characters) with followers? (Ignore people without bios)

In [36]:
users_with_bios = users[users['bio'].notnull()]
users_with_bios.loc[:, 'bio_length'] = users_with_bios['bio'].apply(lambda x: len(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  users_with_bios.loc[:, 'bio_length'] = users_with_bios['bio'].apply(lambda x: len(x.split()))


In [41]:
X = users_with_bios[['bio_length']]
y = users_with_bios[['followers']]

reg = LinearRegression().fit(X, y)
reg.coef_[0][0]

-0.18651238550240148

### 14. Who created the most repositories on weekends (UTC)? List the top 5 users' login in order, comma-separated

In [21]:
# convert the date in to pandas date format
repos['created_at'] = pd.to_datetime(repos['created_at'])

In [22]:
repos_weekend = repos[repos['created_at'].dt.dayofweek.isin([5, 6])]
repos_weekend.groupby('login')['login'].count().sort_values(ascending=False).head(5).index.values

array(['Kushal334', 'alokproc', 'patilswapnilv', 'rajeshpillai',
       'deadcoder0904'], dtype=object)

### 15. Do people who are hireable share their email addresses more often?

In [45]:
frac1 = ((users['hireable'].notnull()) & (users['email'].notnull())).sum() / (users['hireable'].notnull()).sum()
frac2 = ((users['hireable'].isnull()) & (users['email'].notnull())).sum() / (users['hireable'].isnull()).sum()

frac1 - frac2

0.21906415343915348

### 16. Let's assume that the last word in a user's name is their surname (ignore missing names, trim and split by whitespace.) What's the most common surname? (If there's a tie, list them all, comma-separated, alphabetically)

In [47]:
def get_surname(x):
  name = x.split()
  if len(name) == 1:
    return None

  return name[-1]


users_with_name = users[users['name'].notnull()]
users_with_name.loc[:, 'surname'] = users_with_name['name'].apply(get_surname)
users_with_name.loc[:, 'surname'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  users_with_name.loc[:, 'surname'] = users_with_name['name'].apply(get_surname)


Unnamed: 0_level_0,count
surname,Unnamed: 1_level_1
Singh,17
Shah,15
Yadav,13
Shaikh,11
Patil,10
...,...
D'silva,1
Raghuvanshi,1
Crew,1
LTD,1
