# Attack viz

### Table of Content

 - [Set Up](#Setup)
 - [Uniqueness investigation & Cleaning ](#cleaning)
 - [Mounting an Attack](#Mounting_an_Attack)
 - [Dependens discovery](#dependens_discovery)

## Set up <a id='Setup'></a>

In [257]:
import matplotlib
import pandas as pd
import pymysql
import random
import tqdm
%matplotlib inline

In [258]:
# Trying to read the data from the fingerpatch db
# Or if it doesnt't work from the csv
try :
    connection = pymysql.connect(host='localhost',
                             user='fingerpatch',
                             password='fingerpatch',
                             db='fingerpatch',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)
    attack_table = pd.read_sql("SELECT * FROM `ubuntu_captures` ",connection)
    ground_truth = pd.read_sql("SELECT * FROM `ubuntu_packets` ",connection)
    connection.close()

except :
    
    print("No db found, loading from CSV Files")
    
    ground_truth = pd.read_csv("../crawl/ubuntu_packets.csv")
    attack_table = pd.read_csv("../capture/ubuntu_captures.csv")


ground_truth = ground_truth.set_index("id") 
attack_table = attack_table.set_index("capture_id")

No db found, loading from CSV Files


<a id='cleaning'></a>
## Uniqueness investigation & Cleaning 


Select interesting columns and remove duplicated rows

In [259]:
ground_truth.columns

Index(['capture_id', 'Package', 'Version', 'Architecture', 'Size',
       'Installed-Size', 'Priority', 'Maintainer', 'SHA1', 'Description',
       'parsedFrom', 'Description-md5', 'Bugs', 'Origin', 'MD5sum', 'Depends',
       'Homepage', 'Source', 'SHA256', 'Section', 'Supported', 'Filename',
       'packageMode'],
      dtype='object')

In [260]:
print("Total entries without having cleaned: ", len(ground_truth))

Total entries without having cleaned:  128148


In [261]:
# Selecting only interessting fields i.e. the attacker has no mean to distinguish two packages that have the same size but different packageMode
ground_truth = ground_truth.drop_duplicates(['Package', 'Version', 'Size', 'Depends', 'Architecture'])

# Make sure that there is no duplicate information (For a given Package name and Version we have at most one match)
print("The maximum duplication of rows that have the same Package name and Version is: ", ground_truth.groupby(by=["Package", "Version"]).count()["SHA1"].max())

# Selecting only interessting columns
ground_truth = ground_truth.drop(axis= 1, columns=['capture_id','SHA1', 'Priority', 'Description-md5', 'MD5sum', 'SHA256', 'packageMode' ])

ground_truth = ground_truth.fillna("")
print("Total entries after having cleaned: ", len(ground_truth))

The maximum duplication of rows that have the same Package name and Version is:  1
Total entries after having cleaned:  56997


<a id='Mounting_an_Attack'></a>
## Mounting an Attack for matching a specific capture to a package.
##### Relying on package size

In [262]:
target = attack_table.iloc[0]
target

truth_id                                                       103746
nb_flows                                                            3
HTTP_Seq            [['GET /ubuntu/pool/universe/o/opennebula/libo...
Flow1                                   target->yukinko.canonical.com
Flow2                                   yukinko.canonical.com->target
Flow3                                   target->yukinko.canonical.com
Flow4                                                             NaN
Flow5                                                             NaN
nb_Payload_send1                                                    0
nb_Payload_send2                                                67874
nb_Payload_send3                                                  173
nb_Payload_send4                                                  NaN
nb_Payload_send5                                                  NaN
Name: 1, dtype: object

In [263]:
EXTRA_SIZE_AVERAGE = 283   # Made from stats about captured packets
EXTRA_SIZE_VARIATION = 5
size_to_match = target['nb_Payload_send2']

In [264]:
def distance_from_expected_average_size(x, size_to_match):
    return abs(size_to_match - x - EXTRA_SIZE_AVERAGE)

In [265]:
ground_truth["dist_from_expected_size"] = ground_truth["Size"].map(lambda x: distance_from_expected_average_size(x, size_to_match))

In [266]:
ground_truth.sort_values(by="dist_from_expected_size").head()

Unnamed: 0_level_0,Package,Version,Architecture,Size,Installed-Size,Maintainer,Description,parsedFrom,Bugs,Origin,Depends,Homepage,Source,Section,Supported,Filename,dist_from_expected_size
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
39672,libopennebula-java-doc,3.4.1-4.1ubuntu1,all,67592,1194,Ubuntu Developers <ubuntu-devel-discuss@lists....,Java bindings for OpenNebula Cloud API (OCA) -...,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,,http://opennebula.org/,opennebula,universe/doc,,pool/universe/o/opennebula/libopennebula-java-...,1
41519,libshisa-dev,1.0.2-3ubuntu2,amd64,67594,385,Ubuntu Developers <ubuntu-devel-discuss@lists....,Development files for the Shishi Kerberos v5 K...,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,"libshisa0 (= 1.0.2-3ubuntu2), libshishi-dev (=...",http://www.gnu.org/software/shishi/,shishi,universe/libdevel,,pool/universe/s/shishi/libshisa-dev_1.0.2-3ubu...,3
14154,libcloog-isl-dev,0.18.2-1,amd64,67588,377,Ubuntu Developers <ubuntu-devel-discuss@lists....,Chunky Loop Generator (development files),packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,"libisl-dev (>= 0.11), libgmp-dev, libcloog-isl...",http://www.CLooG.org,cloog,libdevel,9m,pool/main/c/cloog/libcloog-isl-dev_0.18.2-1_am...,3
34922,libghc-shakespeare-i18n-prof,1.0.0.2-4build1,amd64,67582,429,Ubuntu Developers <ubuntu-devel-discuss@lists....,type-based approach to internationalization; p...,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,libghc-shakespeare-i18n-dev (= 1.0.0.2-4build1...,http://hackage.haskell.org/package/shakespeare...,haskell-shakespeare-i18n,universe/haskell,,pool/universe/h/haskell-shakespeare-i18n/libgh...,9
26685,gkrellmoon,0.6-5,amd64,67578,320,Ubuntu MOTU Developers <ubuntu-motu@lists.ubun...,Gkrellm Moon Clock Plugin,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,"gkrellm (>= 2.0.0), libatk1.0-0 (>= 1.13.2), l...",,,universe/x11,,pool/universe/g/gkrellmoon/gkrellmoon_0.6-5_am...,13


The first one is the one.

In [267]:
target

truth_id                                                       103746
nb_flows                                                            3
HTTP_Seq            [['GET /ubuntu/pool/universe/o/opennebula/libo...
Flow1                                   target->yukinko.canonical.com
Flow2                                   yukinko.canonical.com->target
Flow3                                   target->yukinko.canonical.com
Flow4                                                             NaN
Flow5                                                             NaN
nb_Payload_send1                                                    0
nb_Payload_send2                                                67874
nb_Payload_send3                                                  173
nb_Payload_send4                                                  NaN
nb_Payload_send5                                                  NaN
Name: 1, dtype: object

<a id='dependens_discovery'></a>
## Dependens discovery

let's take only the packages that have only one depends.

Sort them by ascending size

In [268]:
ground_truth = ground_truth.fillna("")
ground_truth["#Depends"] = ground_truth["Depends"].map(lambda x: 0 if x == "" else len(x.split(",")))
one_dep_first10 = ground_truth[ground_truth["#Depends"] == 1].sort_values(by = "Size", ascending=True)[:10]

In [269]:
one_dep_first10.iloc[0]

Package                                                              readpst
Version                                                       0.6.59-1build1
Architecture                                                             all
Size                                                                     796
Installed-Size                                                            21
Maintainer                 Ubuntu Developers <ubuntu-devel-discuss@lists....
Description                    Converts Outlook PST files to mbox and others
parsedFrom                 packages/archive.ubuntu.com_ubuntu_dists_trust...
Bugs                              https://bugs.launchpad.net/ubuntu/+filebug
Origin                                                                Ubuntu
Depends                                                            pst-utils
Homepage                                  http://www.five-ten-sg.com/libpst/
Source                                                                libpst

Seeking about that depends

In [270]:
ground_truth[ground_truth["Package"] == one_dep_first10.iloc[0]["Depends"]]

Unnamed: 0_level_0,Package,Version,Architecture,Size,Installed-Size,Maintainer,Description,parsedFrom,Bugs,Origin,Depends,Homepage,Source,Section,Supported,Filename,dist_from_expected_size,#Depends
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
18280,pst-utils,0.6.59-1build1,amd64,62092,181,Ubuntu Developers <ubuntu-devel-discuss@lists....,tools for reading Microsoft Outlook PST files,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,"libc6 (>= 2.14), libgcc1 (>= 1:4.1.1), libgd3 ...",http://www.five-ten-sg.com/libpst/,libpst,utils,9m,pool/main/libp/libpst/pst-utils_0.6.59-1build1...,5499,7


Turns out that this Dependens also have its Dependances

In [271]:
sub_dependances = ground_truth[ground_truth["Package"] == one_dep_first10.iloc[0]["Depends"]].iloc[0]["Depends"]
print(sub_dependances)

libc6 (>= 2.14), libgcc1 (>= 1:4.1.1), libgd3 (>= 2.1.0~alpha~), libglib2.0-0 (>= 2.12.0), libgsf-1-114 (>= 1.14.8), libpst4 (>= 0.6.54), libstdc++6 (>= 4.6)


#### Once Downloading the package `readpst` we can ideed see that the package doesn't just depends on one single package but many subpackages from that dependance:

On the Docker:

`The following extra packages will be installed:
  fontconfig-config fonts-dejavu-core libfontconfig1 libfreetype6 libgd3
  libglib2.0-0 libglib2.0-data libgsf-1-114 libgsf-1-common libjbig0
  libjpeg-turbo8 libjpeg8 libpst4 libtiff5 libvpx1 libx11-6 libx11-data
  libxau6 libxcb1 libxdmcp6 libxml2 libxpm4 pst-utils sgml-base
  shared-mime-info xml-core`
  
`0 upgraded, 27 newly installed, 0 to remove and 32 not upgraded.
Need to get 5664 kB of archives.`


On the attacker:

`historic =  ['target->danava.canonical.com', 'danava.canonical.com->target', 'target->danava.canonical.com']
server_ip =  ['91.189.88.149', '172.100.0.100', '91.189.88.149']
server_name =  ['danava.canonical.com', 'target', 'danava.canonical.com']
received_Payload =  [5671834]
send_Payload =  [0, 4251]`

So if we calculate with the tipical extra_size for each downloaded package that we get on the attacker side and knowing that 5664kB is rounded:


In [272]:
EXTRA_SIZE_AVERAGE * 27 + 5664000

5671641

#### Let's find out what happens if we download the dependance before  

While downloading pst-utils (*using apt-get install readpst*):

On the victim:

`The following extra packages will be installed:
  fontconfig-config fonts-dejavu-core libfontconfig1 libfreetype6 libgd3
  libglib2.0-0 libglib2.0-data libgsf-1-114 libgsf-1-common libjbig0
  libjpeg-turbo8 libjpeg8 libpst4 libtiff5 libvpx1 libx11-6 libx11-data
  libxau6 libxcb1 libxdmcp6 libxml2 libxpm4 sgml-base shared-mime-info
  xml-core
0 upgraded, 26 newly installed, 0 to remove and 32 not upgraded.
Need to get 5663 kB of archives.`


On the Attacker:

`historic =  ['target->keeton.canonical.com', 'keeton.canonical.com->target', 'target->keeton.canonical.com']
server_ip =  ['91.189.88.161', '172.100.0.100', '91.189.88.161']
server_name =  ['keeton.canonical.com', 'target', 'keeton.canonical.com']
received_Payload =  [5670760]
send_Payload =  [0, 4085]
Ressources cleaned.`


In [273]:
print(" -- Seen By the Attacker -- Difference by downloading the full package and only it's dependances :",5671834 - 5670760)
print(" -- For the ground_truth -- Difference by downloading the full package and only it's dependances :",796 + EXTRA_SIZE_AVERAGE)

 -- Seen By the Attacker -- Difference by downloading the full package and only it's dependances : 1074
 -- For the ground_truth -- Difference by downloading the full package and only it's dependances : 1079


Now that the dependance is installed on the vicitim's machine, we perfom the update of the principal package:


On the victim:

`The following NEW packages will be installed:
  readpst
0 upgraded, 1 newly installed, 0 to remove and 32 not upgraded.
Need to get 796 B of archives.
After this operation, 21.5 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu/ trusty/universe readpst all 0.6.59-1build1 [796 B]
Fetched 796 B in 0s (3656 B/s)   
Download complete and in download only mode`

On the attacker

`historic =  ['target->steelix.canonical.com', 'steelix.canonical.com->target', 'target->steelix.canonical.com']
server_ip =  ['91.189.88.152', '172.100.0.100', '91.189.88.152']
server_name =  ['steelix.canonical.com', 'target', 'steelix.canonical.com']
received_Payload =  [1074]
send_Payload =  [0, 155]`


Indeed, once the dependance is installed, installing just the package 

In [274]:
1074 - 796

278

### SumOfDependences & NumberOfDependances

In [275]:
one_dep_first10.iloc[1]

Package                                                                 gcom
Version                                                               0.32-2
Architecture                                                             all
Size                                                                     820
Installed-Size                                                            20
Maintainer                 Ubuntu Developers <ubuntu-devel-discuss@lists....
Description                     datacard control tool - transitional package
parsedFrom                 packages/archive.ubuntu.com_ubuntu_dists_trust...
Bugs                              https://bugs.launchpad.net/ubuntu/+filebug
Origin                                                                Ubuntu
Depends                                                                comgt
Homepage                                           http://www.pharscape.org/
Source                                                                 comgt

In [276]:
ground_truth[ground_truth["Package"] == "comgt"]

Unnamed: 0_level_0,Package,Version,Architecture,Size,Installed-Size,Maintainer,Description,parsedFrom,Bugs,Origin,Depends,Homepage,Source,Section,Supported,Filename,dist_from_expected_size,#Depends
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
23140,comgt,0.32-2,amd64,42804,188,Ubuntu Developers <ubuntu-devel-discuss@lists....,Option GlobeTrotter and Vodafone datacard cont...,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,libc6 (>= 2.7),http://www.pharscape.org/,,universe/net,,pool/universe/c/comgt/comgt_0.32-2_amd64.deb,24787,1


In [277]:
ground_truth[ground_truth["Package"] == "libc6"]

Unnamed: 0_level_0,Package,Version,Architecture,Size,Installed-Size,Maintainer,Description,parsedFrom,Bugs,Origin,Depends,Homepage,Source,Section,Supported,Filename,dist_from_expected_size,#Depends
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1784,libc6,2.19-0ubuntu6.14,amd64,4752538,10508,Ubuntu Developers <ubuntu-devel-discuss@lists....,Embedded GNU C Library: Shared libraries,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,libgcc1,http://www.eglibc.org,eglibc,libs,5y,pool/main/e/eglibc/libc6_2.19-0ubuntu6.14_amd6...,4684947,1
14033,libc6,2.19-0ubuntu6,amd64,4729214,10496,Ubuntu Developers <ubuntu-devel-discuss@lists....,Embedded GNU C Library: Shared libraries,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,libgcc1,http://www.eglibc.org,eglibc,libs,5y,pool/main/e/eglibc/libc6_2.19-0ubuntu6_amd64.deb,4661623,1


Dependances can be ambigious

In [278]:
print(ground_truth["Depends"].iloc[48])
print(ground_truth["Depends"].iloc[405])

python (>= 2.7), python (<< 2.8), python:any (>= 2.7.1-0ubuntu2), base-files (>= 4.0.4)
python-bzrlib (<= 2.6.0+bzr6593-1ubuntu1.6.1~), python-bzrlib (>= 2.6.0+bzr6593-1ubuntu1.6), python:any


### Before implementing the recursivity function, we consatated following issues:
    - Going through dependances can lead to cycles (ex: comgt->libc6->libgcc1->libc6)
        => Can be fixed by keeping a list of seen dependances
        
    - Many packages with different version can occure (ex: libc6 2.19-0ubuntu6.14 & 2.19-0ubuntu6 not same size)
        => Maybe take the most recent one (To gain some time keep only the newest version beforehand)
        
    - Some packet are already installed by default (like libc6 in our victim's machin)
    
    - Dependances can be ambigious (ex: python (>= 2.7), python (<< 2.8), python:any (>= 2.7.1-0ubuntu2) )
        => Parsing the dependances has to be made carefully 
 
 Some references:
 
 [Depends field format](https://www.debian.org/doc/debian-policy/ch-relationships.html)
 
 [Version field format](http://www.fifi.org/doc/debian-policy/policy.html/ch-versions.html)

In [396]:

def recursiveSearchOnDep(x, summing, df,alreadySeen):
    """
    x : The current data Serie, Assuming that x contains Package, Version, Depends, Size and 
        Summing dependances, Dependance traces for the dynamic approach
        
    summing : The sum of the size in Bytes
    df is the db we are performing the recursive search
    alreadySeen : Dict with the already seen packages + version
    """
    
    xKey = x["Package"] + " : "+x["Version"]
    if xKey in alreadySeen:
        return (summing, alreadySeen)
    
    alreadySeen[xKey] = []  
    
    deps = parseAndFindDep(x["Depends"], df)
    
    if len(deps) == 0: # Touches the leaves
        
        # Fill the df 
        df.at[x.name, "Summing dependances"] = x["Size"]
        df.at[x.name,"Dependance traces"] = alreadySeen
    
        return ( x["Size"], alreadySeen)
    
    
    for dep in deps:
        
        if dep not in alreadySeen:
            newX = df.loc[dep]

            
            s, as_ = newX["Summing dependances"], newX["Dependance traces"]
            
            if s == -1:

                # Meaning we never saw it before
                s, as_ = recursiveSearchOnDep(newX, 0, df, alreadySeen)
            
                
            # Merging
            summing += s
            alreadySeen = {**as_ , **alreadySeen}
            alreadySeen[xKey] += [newX["Package"] + " : "+newX["Version"]]
    
    
    summing = summing + x["Size"]
    
    df.at[x.name, "Summing dependances"] = summing
    df.at[x.name,"Dependance traces"] = alreadySeen
    
    return (summing, alreadySeen)
    
    

In [280]:


def parseAndFindDep(depString, df):
    """
    Return a list of ubuntu_packages id which represents the
    """
    ids = list()
    
    allPckg = df["Package"].unique()

    for d in depString.split(", "):


        for d2 in d.split(" | "):

            d2 = d2.split(" (")

            package = d2[0]

            #print(package)

            version = ""
            if len(d2) == 2:
                # We have more info about the version
                (req, version) = d2[1][:-1].split(" ")

                if req == "<<" : 
                    req = ">"
                if req == ">>":
                    req = ">"
                if req == "=":
                    req = "=="


            if package in allPckg:

                # TOFIX simple string comparison doesn't work because 2.12.4 > 2.9.3 
                package_candidates = df[df["Package"] == package].sort_values(by="Version", ascending=False)
                id_ = package_candidates.iloc[0].name

                if version != "":
                    # Restraint further more using the version spec.
                    package_candidates = package_candidates.query("Version "+req+" '"+version+"'")    

                    if len(package_candidates) > 0 :
                        # just take the most recent one if there are many versions
                        id_ = package_candidates.iloc[0].name


                # Add it only if it's the first time we add it
                if id_ not in ids:
                    ids = ids + [id_]

                # We found it no need to take the packages after "|"
                break 
                
    return ids
        
        

In [281]:
ground_truth["Summing dependances"] = -1
ground_truth["Dependance traces"] = "{}"
 
ground_truth = ground_truth.sort_values(["#Depends"])

for _, row in tqdm.tqdm(ground_truth.iterrows(), total=len(ground_truth)):
       
    # We don't even enter in the recursion if we already computed it
    if row["Summing dependances"] == -1:
        _, _ = recursiveSearchOnDep(row,  0, ground_truth, {}) 
    
    #ground_truth.at[row.name, "Summing dependances"] = summing
    #ground_truth.at[row.name,"Dependance traces"] = alreadySeen




#%time test["Summing dependances"], test["Dependance traces"] = zip(*test.apply(lambda x: recursiveSearchOnDep(x, 0, test, {}), axis = 1))


100%|██████████| 56997/56997 [50:14<00:00,  1.02it/s]   


Taking one random package:

In [285]:
import random
r = random.randint(0, 56997)
print(r)
ground_truth.iloc[r]

18205


Package                                                         weather-util
Version                                                                2.0-1
Architecture                                                             all
Size                                                                   27922
Installed-Size                                                           153
Maintainer                 Ubuntu Developers <ubuntu-devel-discuss@lists....
Description                command-line tool to obtain weather conditions...
parsedFrom                 packages/archive.ubuntu.com_ubuntu_dists_trust...
Bugs                              https://bugs.launchpad.net/ubuntu/+filebug
Origin                                                                Ubuntu
Depends                                                 python (>= 2.6.6-3~)
Homepage                                   http://fungi.yuggoth.org/weather/
Source                                                                      

In [292]:
print("There are : " , len(ground_truth.iloc[r]["Dependance traces"]), "subdependences for the Package " + ground_truth.iloc[r]["Package"])

There are :  62 subdependences for the Package weather-util


In [387]:
ground_truth[ground_truth["Package"] == "python"]

Unnamed: 0_level_0,Package,Version,Architecture,Size,Installed-Size,Maintainer,Description,parsedFrom,Bugs,Origin,Depends,Homepage,Source,Section,Supported,Filename,dist_from_expected_size,#Depends,Summing dependances,Dependance traces
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
18315,python,2.7.5-5ubuntu3,amd64,133698,671,Ubuntu Developers <ubuntu-devel-discuss@lists....,interactive high-level object-oriented languag...,packages/archive.ubuntu.com_ubuntu_dists_trust...,https://bugs.launchpad.net/ubuntu/+filebug,Ubuntu,"python2.7 (>= 2.7.5-1~), python-minimal (= 2.7...",http://www.python.org/,python-defaults,python,5y,pool/main/p/python-defaults/python_2.7.5-5ubun...,66107,3,143918714,"{'gcc-4.9-base : 4.9.3-0ubuntu4': [], 'perl-ba..."


In [297]:
print("All the dependences for the package python: ")
ground_truth.loc[18315]["Dependance traces"]

All the dependences for the package python: 


{'gcc-4.9-base : 4.9.3-0ubuntu4': [],
 'perl-base : 5.18.2-2ubuntu1.6': [],
 'libconfig-tiny-perl : 2.20-1': ['perl : 5.18.2-2ubuntu1.6'],
 'perl : 5.18.2-2ubuntu1.6': ['perl-base : 5.18.2-2ubuntu1.6',
  'perl-modules : 5.18.2-2ubuntu1.6',
  'libbz2-1.0 : 1.0.6-5',
  'libc6 : 2.19-0ubuntu6.14',
  'libdb5.3 : 5.3.28-3ubuntu3.1',
  'libgdbm3 : 1.8.3-12build1',
  'zlib1g : 1:1.2.8.dfsg-1ubuntu1.1'],
 'perl-modules : 5.18.2-2ubuntu1.6': ['perl : 5.18.2-2ubuntu1.6'],
 'libbz2-1.0 : 1.0.6-5': ['libc6 : 2.19-0ubuntu6.14'],
 'libc6 : 2.19-0ubuntu6.14': ['libgcc1 : 1:4.9.3-0ubuntu4'],
 'libgcc1 : 1:4.9.3-0ubuntu4': ['gcc-4.9-base : 4.9.3-0ubuntu4',
  'libc6 : 2.19-0ubuntu6.14'],
 'dpkg : 1.17.5ubuntu5.8': [],
 'libdb5.3 : 5.3.28-3ubuntu3.1': ['libc6 : 2.19-0ubuntu6.14'],
 'libgdbm3 : 1.8.3-12build1': ['libc6 : 2.19-0ubuntu6.14',
  'dpkg : 1.17.5ubuntu5.8'],
 'debconf : 1.5.51ubuntu2': [],
 'ruby-addressable : 2.3.4-1': ['ruby : 1:1.9.3.4'],
 'ruby : 1:1.9.3.4': ['ruby1.9.1 : 1.9.3.484-2ubuntu1.

### Building the dependances trees as a dictionnary

In [None]:
All_traces = dict()
for index, row in ground_truth.iterrows():
    All_traces = {**All_traces, **row['Dependance traces']}
    
All_traces

In [393]:
All_traces["perl : 5.18.2-2ubuntu1.6"]

['perl-base : 5.18.2-2ubuntu1.6',
 'perl-modules : 5.18.2-2ubuntu1.6',
 'libbz2-1.0 : 1.0.6-5',
 'libc6 : 2.19-0ubuntu6.14',
 'libdb5.3 : 5.3.28-3ubuntu3.1',
 'libgdbm3 : 1.8.3-12build1',
 'zlib1g : 1:1.2.8.dfsg-1ubuntu1.1',
 'perl-modules : 5.18.2-2ubuntu1.6',
 'libbz2-1.0 : 1.0.6-5',
 'libc6 : 2.19-0ubuntu6.14',
 'libdb5.3 : 5.3.28-3ubuntu3.1',
 'libgdbm3 : 1.8.3-12build1',
 'zlib1g : 1:1.2.8.dfsg-1ubuntu1.1']

In [394]:
All_traces["zlib1g : 1:1.2.8.dfsg-1ubuntu1.1"]

['libc6 : 2.19-0ubuntu6.14']

In [395]:
All_traces["libc6 : 2.19-0ubuntu6.14"]

['libgcc1 : 1:4.9.3-0ubuntu4', 'libgcc1 : 1:4.9.3-0ubuntu4']

Let's group our list by size and by sum of dependances:

In [385]:
bySize = ground_truth.groupby(by="Size").count().groupby(by = "Package").count()
total = bySize["Version"].sum()
print(total)
bySize["Version"].map(lambda x : x/total).sort_values(ascending = False).head()

43710


Package
1    0.830611
2    0.110867
3    0.035941
4    0.012560
5    0.004850
Name: Version, dtype: float64

In [383]:
includingDep = ground_truth.groupby(by="Summing dependances").count().groupby(by = "Package").count()
total = includingDep["Version"].sum()
includingDep["Version"].map(lambda x : x/total)

Package
1    0.990101
2    0.009065
3    0.000568
4    0.000160
5    0.000071
6    0.000018
7    0.000018
Name: Version, dtype: float64

In [384]:
both = ground_truth["Size"].append(ground_truth["Summing dependances"])
grouped = both.groupby(both).count()
uniqueness = grouped.groupby(grouped).count().sort_values(ascending = False)
total = uniqueness.sum()
uniqueness.map(lambda x : x/total).head()

1    0.862716
2    0.100886
3    0.021990
4    0.008180
5    0.003178
dtype: float64

Remarques & Questions: 

    - We are just focusing on apt-get install not on apt-get upgrade for a software updates
    
    - All what we are doing it's just for one specific ubuntu release
    
    - When is the midterm presentation, how long should it last?
    
