## Arch User Repository Data

I'm interested in exploring the meta data on the Arch User Repository (AUR). This notebook will show how I scraped data from [https://www.archlinux.org/packages/](https://www.archlinux.org/packages/) and then do some data analysis and visualization. 

### Scraping data

From the AUR we have the following stats: 

Value | Count
---|---
Packages | 42909
Orphan Packages | 2566
Packages added in the past 7 days | 158
Packages updated in the past 7 days | 1252
Packages updated in the past year | 16338
Packages never updated | 9876
Registered Users | 48850
Trusted Users | 46

45557 packages found.	Page 1 of 183. 250 results per page.

In addition to the Arch User Repository, we can also easily gather data for regular Arch Linux packages that are part of the core, community, multilib and extra categories. There are just under 10,000 non-AUR packages that are a core part of Arch Linux. 


In [1]:
from selenium import webdriver
import re
import time
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import os
import requests

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import networkx

## Correction

Arch Linux makes a list of the core packages available, so it is not necessary to scrape the names of the packages. The packages are available here: 

[https://aur.archlinux.org/packages.gz](https://aur.archlinux.org/packages.gz). 

First we will loop through the 184 pages that list packages. This will us a series of HTML files that each contain 250 packages (with the following information: package name (and link), version, votes, popularity, description and maintainer). 


In [12]:
# os.chdir('../html/pages/')
# driver = webdriver.PhantomJS()
# base_url = "https://aur.archlinux.org/packages/?SeB=nd&K=&outdated=&SB=n&SO=a&PP=250&do_Search=Go&O="
# for i in range(0,184):
#     driver.get(base_url+str(i*250))
#     time.sleep(3 + np.random.random())
#     html = driver.page_source.encode('utf-8')
#     name = "page_" + str(i)
#     package_list = open(name+'.txt', 'w+')
#     package_list.write(str(html))
#     package_list.close()

In [None]:
# os.chdir('../html/pages/')
# files = os.listdir()
# dict_list = []
# for file in files:     
#     f = open(file, 'r')
#     html = f.read()
#     b = BeautifulSoup(html, 'lxml')
#     try: 
#         packages = b.find_all('tr')[1:]
#         for package in packages:
#             data = package.find_all('td')
#             data_dict = {
#                          "name": data[0].find('a').text,
#                          "link": data[0].find('a')['href'],
#                          "version":data[1].text,
#                          "votes": int(data[2].text),
#                          "popularity": float(data[3].text), 
#                          "description": data[4].text, 
#                          "user": data[5].text.strip('\\n').strip('\\t').strip('\\n')
#                         }
#             dict_list.append(data_dict)
#         print(f)
#     except Error as e:
#         print(e)
#     b.decompose()
#     f.close()

In [59]:
cols = ['name', 'link', 'version', 'votes', 'popularity', 'description', 'user']
df = pd.DataFrame(dict_list,columns=cols)
df = df.drop_duplicates()
df.to_csv('../csv/aur_data.csv')

(45558, 7)

In [3]:
df = pd.read_csv('../csv/aur_data.csv', index_col=0)
df.shape() #45558,7

This looks good! We have have 45558 packages from the AUR, just one more package than the 45557 that were listed when we started scraping, and 9983 packages from Arch Linux. There is a lot of interesting data right here, but there is more information about each package at the package page in the AUR, so we can visit each package page to scrape additional data which includes related packages, comments, contributors, release dates. Here's an example of all the data we have about an individual package: 

## Individual Packages

Let's take [Spotify](https://aur.archlinux.org/packages/spotify/) as an example of how the package meta-data is structured: 

#### Package Details: spotify 1.0.66.478-1
Attribute | Value
---|---
Git Clone URL: | https://aur.archlinux.org/spotify.git (read-only)
Git Clone URL: | https://aur.archlinux.org/spotify.git (read-only)
Package Base: | spotify
Description: |A proprietary music streaming service
Upstream URL: | http://www.spotify.com
Licenses: | custom:"Copyright (c) 2006-2010 Spotify Ltd"
Submitter: | gadget3000
Maintainer: | AWhetter
Last Packager: | AWhetter
Votes: | 1268
Popularity: | 49.768907
First Submitted: | 2010-07-12 09:17
Last Updated: | 2017-10-29 16:29


Dependencies (15) | Required by (5)
---|---
desktop-file-utils (desktop-file-utils-git) | blockify
gconf (gconf-gtk2) | blockify-git 
glib2 (glib2-git, glib2-patched-thumbnailer, glib2-quiet, glib2-sched-policy) | spotify-adkiller-dns-block-git
gtk2 (gtk2-patched-filechooser-icon-view, gtk2-patched-gdkwin-nullcheck, gtk2-ubuntu) | spotify-adkiller-git
libcurl-compat (libcurl-compat-nostatic) | 
libsystemd (eudev-git, libeudev-systemd, libsystemd-eudev-standalone, libsystemd-git, libsystemd-selinux) | 
libx11 (libx11-nokeyboardgrab) | 
libxss | 
libxtst | 
nss (nss-hg) | 
openssl-1.0 (openssl-1.0-chacha20) | 
rtmpdump (rtmpdump-git, rtmpdump-ksv-git) | 
alsa-lib>=1.0.14 | 
ffmpeg0.10 (optional) – Adds support for playback of local files | 
zenity (qarma-git, zenity-gtk2) (optional) – Adds support for importing local files | 

(5) | Sources
--- | ---
(1) | http://repository.spotify.com/pool/non-free/s/spotify-client/spotify-client_1.0.66.478.g1296534d-39_amd64.deb (x86_64) 
(2) | http://repository.spotify.com/pool/non-free/s/spotify-client/spotify-client_1.0.66.478.g1296534d-39_i386.deb (i686) 
(3) | LICENSE 
(4) | spotify
(5) | spotify.protocol


#### Pinned Comments

NicoHood commented on 2017-05-28 11:45

>@Lenovsky There you go. Please upvote this topic if you wish to have spotify in the official ArchLinux [community] repository.
> 
> https://community.spotify.com/t5/Desktop-Linux-Windows-Web-Player/Redistribute-Spotify-on-Linux-Distributions/m-p/1695334#M188735

#### Latest Comments

skiwithuge commented on 2017-11-23 11:27

> 2017 11 23: need to change spotify version to spotify-client_1.0.67.582.g19436fa3-28_amd64.deb 
>
>http://repository.spotify.com/pool/non-free/s/spotify-client/

[...]

Not all packages list the same information. The spotify package doesn't have architecture, install size or download and install size like other packages do. Some packages have multiple maintainers. Let's write a script that tries to capture the available fields into a dictionary and then create a new DataFrame from that data. 

We may want to build a separate DataFrame for comments with user, date, comment, pinned/not pinned. 

For dependency relationshpis we can simply store a list of each and then use a graph package like NetworkX to analyze how the packages are related.

Also, it will take some time to scrape all of this package data, so we should write our script in a way that can easily pick up from where it last stopped if we lose our connection during execution. 

## Update

We can get the list of packages simply by going to `https://aur.archlinux.org/packages.gz`.

In [None]:
packages = requests.get('https://aur.archlinux.org/packages.gz').text
packages.split('\n')[1:]

In [None]:
base_url = 'https://aur.archlinux.org/packages/'
for package in packages:
        search_url = base_url+package + "/?comments=all"
        html = requests.get(search_url).text
        time.sleep(2)
        file_name = package + '.txt'
        f = open(file_name, 'w+')
        f.write(str(html))
        f.close()
    else: 
        print(f'Skip: {_}')

In [None]:
# os.chdir('../html/pkgs/')
# package_html_files = os.listdir()
# base_url = 'https://aur.archlinux.org/packages/'
# for _, package in df.iterrows():
#     file_name = str(package['name'] + '.txt')
#     if file_name not in package_html_files: 
#         name = package['name']
#         link = package['name']
#         search_url = base_url+link + "/?comments=all"
#         html = requests.get(search_url).text
#         time.sleep(2)
#         f = open(file_name, 'w+')
#         f.write(str(html))
#         f.close()
#     else: 
#         print(f'Skip: {_}')


['w3watch',
 'cant',
 'php-pear',
 'jzip',
 'zork1',
 'zork2',
 'zork3',
 'guifications-clearlooks2glo',
 'atari-adventure',
 'eclipse-subclipse',
 'wallpaper-lightning',
 'squirrelmail',
 'atari-bowling',
 'atari-breakout',
 'atari-combat',
 'atari-space-invaders',
 'mrunit',
 'gtk-gnutella',
 'roundcubemail-plugin-markasjunk2',
 'roundcubemail-plugin-chbox',
 'roundcubemail-plugin-jquery-mobile',
 'roundcubemail-plugin-mobile',
 'eclipse-svnkit',
 'eclipse-dltk-core',
 'eclipse-emf',
 'eclipse-dltk-javascript',
 'eclipse-antlr-runtime',
 'eclipse-dltk-shelled',
 'eclipse-linuxtools',
 'eclipse-dltk-python',
 'eclipse-antlr4-runtime',
 'eclipse-jsonedit',
 'eclipse-goclipse',
 'adwaita-dark-darose',
 'roundcubemail-plugin-keyboard-shortcuts-ng',
 'fceux-svn',
 'ggmud-svn',
 'libiriverdb',
 'griver',
 'blastem-hg',
 'vecx-git',
 'lib32-glib',
 'lib32-gtk',
 'qjoypad',
 'qjoypad-svn',
 'qjoypad-panzi-git',
 'yumbootstrap-git',
 'nesasm-git',
 'yum-metadata-parser',
 'libretro-fmsx-git',