# Обработка данных о сетевом трафике
Целью является привести данные о сетевом трафике к виду, допустимому для использования в будущих алгоритмах или моделях

# Источники
 - [исходный датасет](https://raw.githubusercontent.com/dm-fedorov/pandas_basic/master/%D1%83%D0%BF%D1%80%D0%B0%D0%B6%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F/data/synscan.csv) взят из Интернета
 
# Изменения
 - 30.11.2020 обработка данных, основной код
 - 01.12.2020 оформление блокнота 

In [1]:
import pandas as pd
import numpy as np
import re

### Обработка столбца Info

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/dm-fedorov/pandas_basic/master/%D1%83%D0%BF%D1%80%D0%B0%D0%B6%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F/data/synscan.csv")
df.head(5)

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info
0,1,0.0,172.16.0.8,64.13.134.52,TCP,58,36050 > 443 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
1,2,0.001539,172.16.0.8,64.13.134.52,TCP,58,36050 > 143 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
2,3,0.001597,172.16.0.8,64.13.134.52,TCP,58,36050 > 3306 [SYN] Seq=0 Win=2048 Len=0 MSS=...
3,4,0.00165,172.16.0.8,64.13.134.52,TCP,58,36050 > 199 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
4,5,0.001703,172.16.0.8,64.13.134.52,TCP,58,36050 > 111 [SYN] Seq=0 Win=1024 Len=0 MSS=1460


In [3]:
df.tail(5)

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info
2006,2007,9.387931,64.13.134.52,172.16.0.8,TCP,60,"[TCP Retransmission] 80 > 36050 [SYN, ACK] S..."
2007,2008,11.06419,64.13.134.52,172.16.0.8,TCP,60,"[TCP Retransmission] 22 > 36050 [SYN, ACK] S..."
2008,2009,21.093215,64.13.134.52,172.16.0.8,TCP,60,"[TCP Retransmission] 53 > 36050 [SYN, ACK] S..."
2009,2010,21.40118,64.13.134.52,172.16.0.8,TCP,60,"[TCP Retransmission] 80 > 36050 [SYN, ACK] S..."
2010,2011,23.085343,64.13.134.52,172.16.0.8,TCP,60,"[TCP Retransmission] 22 > 36050 [SYN, ACK] S..."


In [4]:
df["tcp_retransmission"] = df.Info.str.find("[TCP Retransmission]")  
df["tcp_retransmission"]

0      -1
1      -1
2      -1
3      -1
4      -1
       ..
2006    0
2007    0
2008    0
2009    0
2010    0
Name: tcp_retransmission, Length: 2011, dtype: int64

In [5]:
df["tcp_retransmission"] = np.where(df["tcp_retransmission"] == 0, True, False)
df["tcp_retransmission"]

0       False
1       False
2       False
3       False
4       False
        ...  
2006     True
2007     True
2008     True
2009     True
2010     True
Name: tcp_retransmission, Length: 2011, dtype: bool

In [6]:
df.head(100)

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info,tcp_retransmission
0,1,0.000000,172.16.0.8,64.13.134.52,TCP,58,36050 > 443 [SYN] Seq=0 Win=3072 Len=0 MSS=1460,False
1,2,0.001539,172.16.0.8,64.13.134.52,TCP,58,36050 > 143 [SYN] Seq=0 Win=3072 Len=0 MSS=1460,False
2,3,0.001597,172.16.0.8,64.13.134.52,TCP,58,36050 > 3306 [SYN] Seq=0 Win=2048 Len=0 MSS=...,False
3,4,0.001650,172.16.0.8,64.13.134.52,TCP,58,36050 > 199 [SYN] Seq=0 Win=3072 Len=0 MSS=1460,False
4,5,0.001703,172.16.0.8,64.13.134.52,TCP,58,36050 > 111 [SYN] Seq=0 Win=1024 Len=0 MSS=1460,False
...,...,...,...,...,...,...,...,...
95,96,1.690614,172.16.0.8,64.13.134.52,TCP,58,36050 > 7777 [SYN] Seq=0 Win=2048 Len=0 MSS=...,False
96,97,1.690678,172.16.0.8,64.13.134.52,TCP,58,36050 > 4848 [SYN] Seq=0 Win=3072 Len=0 MSS=...,False
97,98,1.690745,172.16.0.8,64.13.134.52,TCP,58,36050 > 32778 [SYN] Seq=0 Win=1024 Len=0 MSS...,False
98,99,1.690808,172.16.0.8,64.13.134.52,TCP,58,36050 > 16080 [SYN] Seq=0 Win=1024 Len=0 MSS...,False


In [7]:
df["Info"] = df["Info"].str.replace(r"[TCP Retransmission]", "", regex=False) # отключаем регулярки
df["Info"]

0       36050  >  443 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
1       36050  >  143 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
2       36050  >  3306 [SYN] Seq=0 Win=2048 Len=0 MSS=...
3       36050  >  199 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
4       36050  >  111 [SYN] Seq=0 Win=1024 Len=0 MSS=1460
                              ...                        
2006     80  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 ...
2007     22  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 ...
2008     53  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 ...
2009     80  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 ...
2010     22  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 ...
Name: Info, Length: 2011, dtype: object

In [8]:
df["Info"] = df["Info"].str.strip()
df["Info"]

0       36050  >  443 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
1       36050  >  143 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
2       36050  >  3306 [SYN] Seq=0 Win=2048 Len=0 MSS=...
3       36050  >  199 [SYN] Seq=0 Win=3072 Len=0 MSS=1460
4       36050  >  111 [SYN] Seq=0 Win=1024 Len=0 MSS=1460
                              ...                        
2006    80  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 L...
2007    22  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 L...
2008    53  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 L...
2009    80  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 L...
2010    22  >  36050 [SYN, ACK] Seq=0 Ack=1 Win=5840 L...
Name: Info, Length: 2011, dtype: object

In [10]:
source_port = []
destination_port = []
for i in range(len(df)): 
    cur_lst = df.Info[i].split(' ')
    cur_lst = [c  for c in cur_lst if c != '' and c!='>']
    source_port.append(int(cur_lst[0]))
    destination_port.append(int(cur_lst[1]))

In [11]:
flags = []
for i in range(len(df)):
    s = df.Info[i]
    sf = s [s.find('[') + 1 : s.find(']')]
    flags.append(sf.split(", "))

In [12]:
seq = []
for i in range(len(df)):
    s = df.Info[i]
    match = re.search(r'Seq=\w+', s)
    if match:
        seq.append (int(match[0].split("=")[1]))
    else:
        seq.append ('None')
win = []
for i in range(len(df)):
    s = df.Info[i]
    match = re.search(r'Win=\w+', s)
    if match:
        win.append (int(match[0].split("=")[1]))
    else:
        win.append ('None')
ack = []
for i in range(len(df)):
    s = df.Info[i]
    match = re.search(r'Ack=\w+', s)
    if match:
        ack.append (int(match[0].split("=")[1]))
    else:
        ack.append ('None')
Len = []
for i in range(len(df)):
    s = df.Info[i]
    match = re.search(r'Len=\w+', s)
    if match:
        Len.append (int(match[0].split("=")[1]))
    else:
        Len.append ('None')
MSS = []
for i in range(len(df)):
    s = df.Info[i]
    match = re.search(r'MSS=\w+', s)
    if match:
        MSS.append (int(match[0].split("=")[1]))
    else:
        MSS.append ('None')

In [13]:
df["Source_port"] = source_port
df["Destination_port"] = destination_port
df["Flags"] = flags
df["Seq"] = seq
df["Win"] = win
df["Ack"] = ack
df["Len"] = Len
df["MSS"] = MSS

In [14]:
df = df.drop(df.columns[[6]], axis='columns')

In [15]:
df.head()

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,tcp_retransmission,Source_port,Destination_port,Flags,Seq,Win,Ack,Len,MSS
0,1,0.0,172.16.0.8,64.13.134.52,TCP,58,False,36050,443,[SYN],0,3072,,0,1460
1,2,0.001539,172.16.0.8,64.13.134.52,TCP,58,False,36050,143,[SYN],0,3072,,0,1460
2,3,0.001597,172.16.0.8,64.13.134.52,TCP,58,False,36050,3306,[SYN],0,2048,,0,1460
3,4,0.00165,172.16.0.8,64.13.134.52,TCP,58,False,36050,199,[SYN],0,3072,,0,1460
4,5,0.001703,172.16.0.8,64.13.134.52,TCP,58,False,36050,111,[SYN],0,1024,,0,1460


### Работа с временными рядами

In [51]:
df["Time"] = pd.to_timedelta( df["Time"].to_numpy() , 'D')

In [52]:
df.head()

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,tcp_retransmission,Source_port,Destination_port,Flags,Seq,Win,Ack,Len,MSS
0,1,00:00:00,172.16.0.8,64.13.134.52,TCP,58,False,36050,443,[SYN],0,3072,,0,1460
1,2,00:02:12.969600,172.16.0.8,64.13.134.52,TCP,58,False,36050,143,[SYN],0,3072,,0,1460
2,3,00:02:17.980800,172.16.0.8,64.13.134.52,TCP,58,False,36050,3306,[SYN],0,2048,,0,1460
3,4,00:02:22.560000,172.16.0.8,64.13.134.52,TCP,58,False,36050,199,[SYN],0,3072,,0,1460
4,5,00:02:27.139200,172.16.0.8,64.13.134.52,TCP,58,False,36050,111,[SYN],0,1024,,0,1460


### Сохранение обработанных данных

In [55]:
df.to_csv('processed_synscan.csv')