# Microsoft KBs Analysis

## What KB data sources do we have?

```{mermaid}

graph TD


    A[winbindex fa:fa-database] --> B[KBs]
    C[ms support feeds fa:fa-database] -->B
    B --> D[List of updated Binaries fa:fa-file]
    B --> E[Release dates fa:fa-calendar]
    B --> F[Build Versions]
```


### winbindex

Winbindex pulls Windows OS Builds by [scraping the Windows update pages](https://github.com/m417z/winbindex/blob/gh-pages/data/upd01_get_list_of_updates.py) for Windows 10 and 11 Updates from: 
- https://support.microsoft.com/en-us/help/4000823
- https://support.microsoft.com/en-us/help/5006099

It has several more steps in it's [workflow](https://github.com/m417z/winbindex/tree/gh-pages/data#winbindex-flow-of-scripts). 

The `cvedata` code that parses the winbindex data is here [winbindex.py](https://github.com/clearbluejar/cvedata/blob/main/cvedata/winbindex.py)

### MS Atom RSS Feeds KBs

MS Feeds KBs relies on:  https://support.microsoft.com/en-us/rss-feed-picker

The `cvedata` code is here [ms_feed_kbs.py](https://github.com/clearbluejar/cvedata/blob/main/cvedata/ms_feed_kbs.py)

Pulling data from 

- WIN10_FEED_URL = "https://support.microsoft.com/en-us/feed/atom/6ae59d69-36fc-8e4d-23dd-631d98bf74a9"
- WIN11_FEED_URL = "https://support.microsoft.com/en-us/feed/atom/4ec863cc-2ecd-e187-6cb3-b50c6545db92"
- WIN_SERVER_2022_FEED_URL = "https://support.microsoft.com/en-us/feed/atom/2d67e9fb-2bd2-6742-08ee-628da707657f"
- WIN_SERVER_2019_FEED_URL = "https://support.microsoft.com/en-us/feed/atom/eb958e25-cff9-2d06-53ca-f656481bb31f"
- WIN_SERVER_2016_FEED_URL = "https://support.microsoft.com/en-us/feed/atom/c3a1be8a-50db-47b7-d5eb-259debc3abcc"

## Import cvedata

In [None]:
from cvedata.ms_feed_kbs import get_ms_kb_to_bins_json
from cvedata.winbindex import get_winbindex_kbs_to_bin_map
from cvedata.msrc_pandas import get_msrc_cvrf_pandas_df
import pandas as pd

wb_kbs = get_winbindex_kbs_to_bin_map()
ms_kbs = get_ms_kb_to_bins_json()
msrc_pandas = get_msrc_cvrf_pandas_df()

### Winbindex KB Data

In [None]:
wb_kbs_df = pd.DataFrame.from_dict(wb_kbs, orient='index')
wb_kbs_df['bin count'] = wb_kbs_df['updated'].apply(lambda x: len(x))
without_bins = wb_kbs_df[wb_kbs_df['bin count'] == 0].index
wb_kbs_df = wb_kbs_df.drop(without_bins)
wb_kbs_df.sort_values(by=['bin count'], ascending=False)


### MS feeds KB Data

In [None]:
ms_kbs_df = pd.DataFrame.from_dict(ms_kbs).sort_values(by=['bin count'], ascending=False)
ms_kbs_df

### How many unique KBs do we have information for?

In [None]:

wb_kbs_df.index.union(ms_kbs_df.index).shape[0]


### Why does Winbindex report more updated files?

In [None]:
#wb_updated = wb_kbs_df['updated'].explode().str.split('.')
#wb_updated = wb_kbs_df[]
# wb_updated_index = wb_updated.apply(lambda x: x[1]).value_counts().index
# wb_updated_index
#wb_updated
file_ext_df = pd.DataFrame({'wb' : pd.Series(wb_kbs_df.loc['KB5012643']['updated']), 'mskb' : pd.Series(ms_kbs_df.loc['KB5012643']['updated'])})
file_ext_df = file_ext_df.applymap(lambda x: str(x).split('.')[-1])
file_ext_df

In [None]:
wb_ext = set(file_ext_df['wb'].str.lower().unique())
wb_ext

In [None]:
mskb_ext = set(file_ext_df['mskb'].str.lower().unique())
mskb_ext

In [None]:
wb_ext.difference(mskb_ext)

In [None]:
mskb_ext.difference(wb_ext)

Winbindex updated files data has quite a few more types! 

## MSRC CVEs with KB data

Source - [MSRC CVEs](https://msrc.microsoft.com/update-guide/vulnerability)

In [None]:
msrc_df = pd.DataFrame.from_dict(get_msrc_cvrf_pandas_df())
msrc_df

### How many MSRC CVEs have KB data?

In [None]:
def has_ms_kb(kbs):

    has_kb = False

    for kb in kbs.split():
        if kb in ms_kbs_df.index:
            has_kb = True
            break

    return has_kb

def has_wb_kb(kbs):

    has_kb = False

    for kb in kbs.split():
        if kb in wb_kbs_df.index:
            has_kb = True
            break

    return has_kb

def has_kb(kbs):

    has_kb = False

    for kb in kbs.split():
        if kb in wb_kbs_df.index or kb in ms_kbs_df.index:
            has_kb = True

    return has_kb

def missing_all_kbs(kbs):
    missing_all_kbs = False

    # note
    count = 0
    for kb in kbs.split():
        if kb in wb_kbs_df.index or kb in ms_kbs_df.index:
            count += 1

    if len(kbs) > 0 and count == 0:
        missing_all_kbs = True

    return missing_all_kbs

msrc_df['has_kb'] = msrc_df['KBs'].apply(has_kb)
msrc_df['has_ms_kb'] = msrc_df['KBs'].apply(has_ms_kb)
msrc_df['has_wb_kb'] = msrc_df['KBs'].apply(has_wb_kb)
msrc_df['missing_all_kbs'] = msrc_df['KBs'].apply(missing_all_kbs)
msrc_df['no_kb_info'] = msrc_df['KBs'].apply(lambda x: len(x) == 0)
msrc_df

### How many MSRC CVEs have KB data by Year?

In [None]:
msrc_df['year'] = msrc_df['Initial Release'].apply(lambda x: x.split('-')[0])

msrc_df.reset_index().groupby(by='year')['index','has_ms_kb','has_wb_kb','no_kb_info','missing_all_kbs'].sum().plot(kind='bar',figsize=(20,5), title='cvedata KB stats by Year')


### How Many Per Month in 2022?

In [None]:
msrc_df['date'] = pd.to_datetime(msrc_df['Initial Release'])
msrc_date_df = msrc_df.groupby(by='date')['has_ms_kb','has_wb_kb','no_kb_info','missing_all_kbs'].sum()
msrc_date_df[msrc_date_df.index.year.isin([2022])].plot(kind='bar',figsize=(20,5), title='cvedata KB stats for 2022')


## KB Source Improvement

### What type of CVEs are we missing KBs for? (CVEs with KBs listed, but we lack the KB source)

In [None]:

msrc_df[msrc_df['missing_all_kbs'] == True]['Title'].value_counts()[:50]

In [None]:
msrc_df[msrc_df['missing_all_kbs'] == True]['Title'].value_counts()[:50].plot(kind='bar',figsize=(20,5), title='Types of CVEs missing related KBs sources')

### What type of CVEs lack KB information completely?

In [None]:
msrc_df[msrc_df['no_kb_info'] == True]['Tag'].value_counts()[:50]

In [None]:
msrc_df[msrc_df['no_kb_info'] == True]['Tag'].value_counts()[:50].plot(kind='bar',figsize=(20,5), title='Types of CVEs without KB data listed')

In [None]:
msrc_df[msrc_df['no_kb_info'] == True]['Tag'].value_counts()[1:50].plot(kind='bar',figsize=(20,5), title='Types of CVEs without KB data listed (ignoring Chrome)')

## Most often updated Binary

### Create Dataframe from all sources

In [None]:
import itertools
all_kbs_df = pd.concat([wb_kbs_df,ms_kbs_df]).sort_index()
all_kbs_df.index.name = 'kb'
all_kbs_df = all_kbs_df.groupby('kb').aggregate(list)
all_kbs_df['updated'] = all_kbs_df['updated'].apply(lambda x: list(set(itertools.chain.from_iterable(x))))
all_kbs_df['bin count'] = all_kbs_df['updated'].apply(len)
all_kbs_df

In [None]:
freq_bins = all_kbs_df['updated'].explode().value_counts()
freq_bins.head(50)

In [None]:
freq_bins[:100].plot(kind='bar',figsize=(20,5),title='Most Often Updated Binary')

### Graph number of updated files per KB 

In [None]:
all_kbs_df['bin count'].sort_values(ascending=False)[:50].plot(kind='bar',figsize=(20,5))

### Average number per KB

In [None]:

all_kbs_df['bin count'].sort_values(ascending=False).mean()