# Filtering DSL & Non-DSL related Tickets

The main goal of this notebook is to filter DSL related issue tickets from tickets database for time span of 1st Jan 2020 to April 2021.



>Output data format expected:
* We would be expecting a dataframe with 3 unique columns i.e., **|assetid|incident_date|label|** which would provide us the DSL tickets reached to the maximum level(expensive solution) for any assetid.



>Steps involved:
* Get tickets data 
* Get DSL data for same time span and filter only those tickets which have dsl data in database
* Filter all cpe relacement and refurbushed cpe with Issues and keep noproblem one


## Table Of Contents:
* [1. Import packages and config](#sec1)
    * [1.1 Spark configuration](#sec2)
    * [1.2 Import Packages](#sec3)
    * [1.3 Setup style](#sec4)
* [2. Load tickets data](#sec5)
* [3. Load pol agg data](#sec6)
* [4. Load different data](#sec7)
    * [4.1 Load Replacement data](#sec8)
    * [4.2 Load Refurbishment data](#sec9)
    * [4.3 Join Refurbishment & Replacement data](#sec10)
    * [4.4 Filter CPEs with "No Problem"](#sec11)
    * [4.5 Filter CPES with all issues from the data](#sec12)
* [5. Filter DSL ticketds data](#sec13)
    * [5.1 Filter DSL tickets data with all levels](#sec14)
* [6. Filter Non-DSL tickets data](#sec15)

## 1. Import packages and config <a class="anchor" id="sec1"></a>

###  1.1 Spark configuration <a class="anchor" id="sec2"></a>
Lets configure our spark session

In [1]:
%%configure -f
{"conf":
 {"spark.driver.cores": "6",
  "spark.driver.memory": "14g",
  "spark.executor.cores": "6",
  "spark.executor.memory": "14g",
  "spark.dynamicAllocation.enabled": "true",
  "spark.dynamicAllocation.minExecutors" : "4",
  "spark.driver.maxResultSize": "4g"
    }
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1458,application_1620749185953_138131,pyspark,idle,Link,Link,


###  1.2 Import packages <a class="anchor" id="sec3"></a>

In [2]:
# Data Science Packages
import sys
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from  matplotlib import pyplot
import seaborn as sns
import warnings
from datetime import datetime
from datetime import timedelta
import datetime


#Spark Packages
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from pyspark.sql import types 
import pyspark.sql.types
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import col

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1393,application_1620749185953_137179,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

###  1.3  Let's setup some style!<a class="anchor" id="sec3"></a>
This is not a requirement but anyway you know that they say:
> "Fashions fade, style is eternal." <br/>
_Yves Saint Laurent_

In [3]:
# Matplotlib

#matplotlib.use('agg')
plt.switch_backend('agg')

# Seaborn Style
sns.set(style='ticks')
sns.set_style({'font.family': 'Hiragino Maru Gothic Pro'})
sns.set_palette("cool")

# Pandas Style
pd.set_option("display.max_column", 9999)
pd.set_option("display.max_row", 9999)

# Ignore annoying warning 
warnings.filterwarnings('ignore')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

##  2. Loading Tickets Data  <a class="anchor" id="sec5"></a>

Let's load tickets data with some necessary fields

In [4]:
path_tickets_1 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/tickets/Smetnje_0107-1510_2020.csv"
df_tickets_1 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_tickets_1,header = True,sep=";")
df_tickets_1 = df_tickets_1.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje', 'WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')

path_tickets_2 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/tickets/FAULTS_E2E_01102020-31122020.csv"
df_tickets_2 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_tickets_2,header = True,sep=";")
df_tickets_2 = df_tickets_2.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje', 'WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')

path_tickets_3 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/tickets/FAULTS_01012020-30062020.csv"
df_tickets_3 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_tickets_3,header = True,sep=";")
df_tickets_3 = df_tickets_3.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje', 'WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')

path_tickets_4 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/tickets/FAULTS_01012021-30042021.csv"
df_tickets_4 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_tickets_4,header = True,sep=";")
df_tickets_4 = df_tickets_4.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje', 'WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')

df_tickets_1 = df_tickets_1.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje','WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')
df_tickets_1 = df_tickets_1.withColumn('START_DATE_TICKET',F.unix_timestamp('START_DATE_TIME_DONAT', "yyyy-MM-dd'T'HH:mm:ss").cast(types.TimestampType()))
df_tickets_1 = df_tickets_1.withColumn('START_DATE_TICKET',F.date_format(F.col("START_DATE_TICKET"), "yyy-MM-dd HH:mm:ss"))

df_tickets_2 = df_tickets_2.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje','WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')
df_tickets_2 = df_tickets_2.withColumn('START_DATE_TICKET',F.unix_timestamp('START_DATE_TIME_DONAT', "yyyy-MM-dd HH:mm:ss").cast(types.TimestampType()))
df_tickets_2 = df_tickets_2.withColumn('START_DATE_TICKET',F.date_format(F.col("START_DATE_TICKET"), "yyy-MM-dd HH:mm:ss"))

df_tickets_3 = df_tickets_3.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje','WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')
df_tickets_3 = df_tickets_3.withColumn('START_DATE_TICKET',F.unix_timestamp('START_DATE_TIME_DONAT', "yyyy-MM-dd'T'HH:mm:ss").cast(types.TimestampType()))
df_tickets_3 = df_tickets_3.withColumn('START_DATE_TICKET',F.date_format(F.col("START_DATE_TICKET"), "yyy-MM-dd HH:mm:ss"))

df_tickets_4 = df_tickets_4.select('TICKET_ID','START_DATE_TIME_DONAT','ASSET_ID','Uzrok_smetnje','WORK_CODE','Vrsta_prijavljene_smetnje','SOURCE','SOLUTION_LEVEL','SUMMARY')
df_tickets_4 = df_tickets_4.withColumn('START_DATE_TICKET',F.unix_timestamp('START_DATE_TIME_DONAT', "yyyy-MM-dd'T'HH:mm:ss").cast(types.TimestampType()))
df_tickets_4 = df_tickets_4.withColumn('START_DATE_TICKET',F.date_format(F.col("START_DATE_TICKET"), "yyy-MM-dd HH:mm:ss"))

df_tickets = df_tickets_1.union(df_tickets_2)
df_tickets = df_tickets.union(df_tickets_3)
df_tickets = df_tickets.union(df_tickets_4)
df_tickets = df_tickets.dropDuplicates()
df_tickets = df_tickets.withColumnRenamed('Uzrok_smetnje', 'TICKET_TENTATIVE_ROOT_CAUSE')
df_tickets = df_tickets.withColumnRenamed('Vrsta_prijavljene_smetnje', 'Type_of_reported_interference')
df_tickets = df_tickets.drop('START_DATE_TIME_DONAT')

df_tickets = df_tickets.dropDuplicates()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
df_tickets.describe().select('START_DATE_TICKET').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+
|  START_DATE_TICKET|
+-------------------+
|            1218714|
|               null|
|               null|
|2020-01-01 00:13:34|
|2021-04-30 23:57:23|
+-------------------+

In [6]:
print('Number of Unique AssetID in tickets dattabase:',df_tickets.select('ASSET_ID').distinct().count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

('Number of Unique AssetID in tickets dattabase:', 658190)

##  3. Loading Poll day agg data<a class="anchor" id="sec6"></a>

Load pol aggregated dat for the same timeframe as tickets dataset

In [5]:
def rename_cols(df, prefix):
    """Column renameb by adding a prefix.
    
    Args:
        df (dataframe): Input dataframe.
        
    Returns:
        df (dataframe): Output datframe witha ppended prefix to columns.
    """
    for feature in df.columns:
        df = df.withColumnRenamed(feature,prefix+feature)
    return df

def parsing_date_format(datetime_str, target_date_format):
    """Fuction to modify the date format.
    
    Args:
        datetime_str (str): Date in original format. expected "%Y-%m-%d".
        target_date_format (str): Desire date format e.g "%Y%m%d".
    
    Returns
        str: Date in target format.
    """
    date_obj = datetime.strptime(datetime_str, "%Y-%m-%d").date()
    return date_obj.strftime(target_date_format)

def data_import(database, table, start_time, end_time, spark):
    """Function to load data from Hive tables based on start time and end time.
    
    Args:
        database (string): Database name in HIVE.
        table (string): Table name in HIVE.
        start_time (str): Start time for the query.
        end_time (str): End time for the query.
        spark (obj): Spark session object.
        
    Returns:
        dataframe :  spark dataframe with loaded data.
    """
    #start_time = parsing_date_format(start_time, "%Y-%m-%d")
    #end_time = parsing_date_format(end_time, "%Y-%m-%d")

    df = spark.sql("select * from {0}.{1} where {2} between '{3}' and '{4}' " \
              .format(database, table, 'datum', start_time, end_time))
    return df

def spark_init():
    """Init Spark session.
    Returns:
        object: spark session.
    """
    spark = SparkSession.builder \
        .master('yarn') \
        .appName('predictive_care') \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    return spark

spark=spark_init()
start_time, end_time = '2020-01-01', '2021-04-30'
df_dslam = data_import('cdl_blos', 'pol_day_aggregation', start_time, end_time, spark) #per day

    
#dropping missing keys
df_dslam =df_dslam.na.drop(subset=['assetid','datum'])
    
# Drop multiple samples per day
df_dslam = df_dslam.dropDuplicates(['assetid','datum'])
    
#Renamming for conditional join
df_dslam = rename_cols(df_dslam, 'dslam_')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
df_dslam.columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['dslam_date_inserted', 'dslam_dns_device', 'dslam_slot', 'dslam_port', 'dslam_ip_device', 'dslam_dis_device', 'dslam_distinct_modulation', 'dslam_no_of_dominant_modulation', 'dslam_code_of_dominant_modulation', 'dslam_avg_bitrate_us', 'dslam_max_bitrate_us', 'dslam_min_bitrate_us', 'dslam_avg_bitrate_ds', 'dslam_max_bitrate_ds', 'dslam_min_bitrate_ds', 'dslam_avg_attenuation_ds', 'dslam_max_attenuation_ds', 'dslam_min_attenuation_ds', 'dslam_avg_attenuation_us', 'dslam_max_attenuation_us', 'dslam_min_attenuation_us', 'dslam_avg_power_us', 'dslam_max_power_us', 'dslam_min_power_us', 'dslam_avg_power_ds', 'dslam_max_power_ds', 'dslam_min_power_ds', 'dslam_avg_att_bitrate_us', 'dslam_max_att_bitrate_us', 'dslam_min_att_bitrate_us', 'dslam_avg_att_bitrate_ds', 'dslam_max_att_bitrate_ds', 'dslam_min_att_bitrate_ds', 'dslam_avg_snr_us', 'dslam_max_snr_us', 'dslam_min_snr_us', 'dslam_avg_snr_ds', 'dslam_max_snr_ds', 'dslam_min_snr_ds', 'dslam_no_counts', 'dslam_avg_bandline_ds', 'dslam_max_b

In [6]:
df_ticket_filtered_dsl = df_tickets.join(df_dslam,df_tickets.ASSET_ID == df_dslam.dslam_assetid, how = 'inner')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
print('Number of assetid which are related to DSL issues:',df_ticket_filtered_dsl.select('ASSET_ID').distinct().count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

('Number of assetid which are related to DSL issues:', 320253)

In [7]:
df_ticket_filtered_dsl = df_ticket_filtered_dsl.drop('dslam_date_inserted', 'dslam_dns_device', 'dslam_slot', 'dslam_port', 'dslam_ip_device', 'dslam_dis_device', 'dslam_distinct_modulation', 'dslam_no_of_dominant_modulation', 'dslam_code_of_dominant_modulation', 'dslam_avg_bitrate_us', 'dslam_max_bitrate_us', 'dslam_min_bitrate_us', 'dslam_avg_bitrate_ds', 'dslam_max_bitrate_ds', 'dslam_min_bitrate_ds', 'dslam_avg_attenuation_ds', 'dslam_max_attenuation_ds', 'dslam_min_attenuation_ds', 'dslam_avg_attenuation_us', 'dslam_max_attenuation_us', 'dslam_min_attenuation_us', 'dslam_avg_power_us', 'dslam_max_power_us', 'dslam_min_power_us', 'dslam_avg_power_ds', 'dslam_max_power_ds', 'dslam_min_power_ds', 'dslam_avg_att_bitrate_us', 'dslam_max_att_bitrate_us', 'dslam_min_att_bitrate_us', 'dslam_avg_att_bitrate_ds', 'dslam_max_att_bitrate_ds', 'dslam_min_att_bitrate_ds', 'dslam_avg_snr_us', 'dslam_max_snr_us', 'dslam_min_snr_us', 'dslam_avg_snr_ds', 'dslam_max_snr_ds', 'dslam_min_snr_ds', 'dslam_no_counts', 'dslam_avg_bandline_ds', 'dslam_max_bandline_ds', 'dslam_min_bandline_ds', 'dslam_avg_bandline_us', 'dslam_max_bandline_us', 'dslam_min_bandline_us', 'dslam_avg_net_datarate_ds', 'dslam_max_net_datarate_ds', 'dslam_min_net_datarate_ds', 'dslam_avg_net_datarate_us', 'dslam_max_net_datarate_us', 'dslam_min_net_datarate_us', 'dslam_sum_cv_us', 'dslam_count_cv_us', 'dslam_sum_cv_ds', 'dslam_count_cv_ds', 'dslam_sum_es_ds', 'dslam_count_es_ds', 'dslam_sum_es_us', 'dslam_count_es_us', 'dslam_sum_ses_ds', 'dslam_count_ses_ds', 'dslam_sum_ses_us', 'dslam_count_ses_us', 'dslam_sum_fec_ds', 'dslam_count_fec_ds', 'dslam_sum_fec_us', 'dslam_count_fec_us', 'dslam_inits', 'dslam_platforma', 'dslam_model', 'dslam_dis_dslam_name', 'dslam_dis_dslam_slot_port', 'dslam_em_dslam_name', 'dslam_em_dslam_slot_port', 'dslam_card_type', 'dslam_port_access_id', 'dslam_status_porta', 'dslam_bandwidth', 'dslam_vdsl_adsl', 'dslam_vendor', 'dslam_assetid', 'dslam_status', 'dslam_port_inst_id', 'dslam_internet', 'dslam_iptv', 'dslam_voip', 'dslam_servis', 'dslam_regija', 'dslam_razdjelnik', 'dslam_rg_port', 'dslam_rg_port_inst_id', 'dslam_prim_izvod', 'dslam_prim_izvod_inst_id', 'dslam_prim_parica', 'dslam_sek_izvod', 'dslam_sek_parica', 'dslam_em_profil', 'dslam_datum')
df_ticket_filtered_dsl.limit(3).show(200,truncate=False, vertical=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

-RECORD 0----------------------------------------------------
 TICKET_ID                     | 44888997                    
 ASSET_ID                      | 2222714                     
 TICKET_TENTATIVE_ROOT_CAUSE   | Dotrajalost - 1             
 WORK_CODE                     | FAULTREPAIR                 
 Type_of_reported_interference | prekid                      
 SOURCE                        | WWMS                        
 SOLUTION_LEVEL                | 3                           
 SUMMARY                       | DSL SINHRONIZACIJA - PREKID 
 START_DATE_TICKET             | 2020-03-10 11:32:47         
-RECORD 1----------------------------------------------------
 TICKET_ID                     | 44888997                    
 ASSET_ID                      | 2222714                     
 TICKET_TENTATIVE_ROOT_CAUSE   | Dotrajalost - 1             
 WORK_CODE                     | FAULTREPAIR                 
 Type_of_reported_interference | prekid                      
 SOURCE 

In [13]:
df_ticket_filtered_dsl.groupBy("SOLUTION_LEVEL") \
    .agg(F.countDistinct("ASSET_ID").alias("count_assetid")).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------+-------------+
|SOLUTION_LEVEL|count_assetid|
+--------------+-------------+
|          null|           14|
|             1|       130131|
|             3|       219888|
|             4|        34736|
|             2|        79467|
|             0|         2277|
+--------------+-------------+

## 4. Loading different data<a class="anchor" id="sec7"></a>

### 4.1 Loading Replacement data<a class="anchor" id="sec8"></a>

In [8]:
path_cpe_replacement1 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/replacement/cpe_replacement.csv"
df_cpe_replacement1 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_cpe_replacement1,header = True,sep=";")

path_cpe_replacement2 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/replacement/replacement_new_data/CPE replacement_2020_01-06.csv"
df_cpe_replacement2 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_cpe_replacement2,header = True,sep=";")

path_cpe_replacement3 = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/replacement/replacement_new_data/CPE replacement_2021_01-07.csv"
df_cpe_replacement3 = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_cpe_replacement3,header = True,sep=";")
df_cpe_replacement3 = df_cpe_replacement3.filter((df_cpe_replacement3.Month == 1) | (df_cpe_replacement3.Month == 2) | (df_cpe_replacement3.Month == 3) | (df_cpe_replacement3.Month == 4))

df_cpe_replacement1 = df_cpe_replacement1.select('TICKET_ID','ASSET_ID', 'CPE_SERIAL_NUMBER')
df_cpe_replacement2 = df_cpe_replacement2.select('TICKET_ID','ASSET_ID', 'CPE_SERIAL_NUMBER')
df_cpe_replacement3 = df_cpe_replacement3.select('TICKET_ID','ASSET_ID', 'CPE_SERIAL_NUMBER')
df_cpe_replacement = df_cpe_replacement1.union(df_cpe_replacement2).union(df_cpe_replacement3)
df_cpe_replacement = df_cpe_replacement.dropDuplicates()
df_cpe_replacement = df_cpe_replacement.na.drop(subset=['TICKET_ID','ASSET_ID','CPE_SERIAL_NUMBER'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### 4.2 Loading Refurbishment data<a class="anchor" id="sec9"></a>

In [9]:
path_cpe_refurbishment = "hdfs://nameservicedev1//user/dt_srajan/predictive_care/replacement/CPE_refurbishment_2020_2021.csv"
df_cpe_refurbishment = spark.read.option("encoding", "ISO-8859-1").option("inferSchema", "true").csv(path_cpe_refurbishment,header = True,sep=",")

from pyspark.sql.types import *

df_cpe_refurbishment = df_cpe_refurbishment.withColumn('DATE_SCAN_END',F.when(F.unix_timestamp('Datum_izlaznog_skeniranja', "dd/MM/yyyy").cast(TimestampType()).isNotNull (),\
                                                                      F.unix_timestamp('Datum_izlaznog_skeniranja', "dd/MM/yyyy").cast(TimestampType()))\
                                                                .otherwise(None))\
                                            .withColumn('DATE_SCAN_START',F.when(F.unix_timestamp('Datum_ulaznog_skeniranja', "dd/MM/yyyy").cast(TimestampType()).isNotNull (),\
                                                                      F.unix_timestamp('Datum_ulaznog_skeniranja', "dd/MM/yyyy").cast(TimestampType()))\
                                                                .otherwise(None))

df_cpe_refurbishment = df_cpe_refurbishment.withColumn('DATE_SCAN_END',F.date_format(F.col("DATE_SCAN_END"), "yyyy-MM-dd HH:mm:ss"))\
                                            .withColumn('DATE_SCAN_START',F.date_format(F.col("DATE_SCAN_START"), "yyyy-MM-dd HH:mm:ss"))

df_cpe_refurbishment = df_cpe_refurbishment.withColumnRenamed('Vrsta kvara', 'Vrsta_kvara')

df_cpe_refurbishment.select('DATE_SCAN_START',  "DATE_SCAN_END").describe().show()

df_cpe_refurbishment = df_cpe_refurbishment.filter(df_cpe_refurbishment.Vrsta_kvara != 'NULL')
df_cpe_refurbishment = df_cpe_refurbishment.withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Napajanje', 'power_supply')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'ZnaÄajno fiziÄko oÅ¡teÄenje - nepopravljivo', 'significant_physical_damage_irreparable')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Ne prijavljuje se u ACS', 'does_not_log_into_acs')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Reset ne radi', 'reset_not_work')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Manje fiziÄko ili toplinsko oÅ¡teÄenje', 'minor_physical_thermal_damage')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Ispravan', 'no_problem')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'OptiÄki port', 'optical_port')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'SIM utor', 'sim_slot')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'WIFI', 'wifi')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Blokiran', 'blocked')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Resetira se sam', 'reset_itself')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Internet ne radi', 'internet_not_working')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'DSL port', 'dsl_port')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Pregrijavanje ureÄaja', 'device_overheating')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'LAN port', 'lan_port')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Slaba brzina', 'poor_speed')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Telefon port', 'telefon_port')) \
     .withColumn('Vrsta_kvara', F.regexp_replace('Vrsta_kvara', 'Software', 'software'))

df_cpe_refurbishment = df_cpe_refurbishment.withColumnRenamed('Vrsta_kvara', 'CPE_ISSUE')\
                                            .withColumnRenamed('Serijski_broj', 'CPE_SERIAL_NUMBER')
spaceDeleteUDF = F.udf(lambda s: s.replace("/ ", "/"), StringType())
df_cpe_refurbishment = df_cpe_refurbishment.withColumn("CPE_ISSUE", spaceDeleteUDF("CPE_ISSUE"))

df_cpe_refurbishment = df_cpe_refurbishment.select('CPE_SERIAL_NUMBER', 'DATE_SCAN_START','DATE_SCAN_END','CPE_ISSUE')
df_cpe_refurbishment.limit(3).show(200,truncate=False, vertical=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------------------+-------------------+
|summary|    DATE_SCAN_START|      DATE_SCAN_END|
+-------+-------------------+-------------------+
|  count|              90920|              90920|
|   mean|               null|               null|
| stddev|               null|               null|
|    min|2020-01-02 00:00:00|2020-01-14 00:00:00|
|    max|2021-04-23 00:00:00|2021-04-30 00:00:00|
+-------+-------------------+-------------------+

-RECORD 0------------------------------------------
 CPE_SERIAL_NUMBER | j835bh004425                  
 DATE_SCAN_START   | 2020-01-14 00:00:00           
 DATE_SCAN_END     | 2020-01-14 00:00:00           
 CPE_ISSUE         | minor_physical_thermal_damage 
-RECORD 1------------------------------------------
 CPE_SERIAL_NUMBER | J833BH004104                  
 DATE_SCAN_START   | 2020-01-14 00:00:00           
 DATE_SCAN_END     | 2020-01-14 00:00:00           
 CPE_ISSUE         | minor_physical_thermal_damage 
-RECORD 2--------------------

### 4.3 Joining replacement and refurbishment data <a class="anchor" id="sec10"></a>

In [10]:
df = df_cpe_replacement.join(df_cpe_refurbishment, on = 'CPE_SERIAL_NUMBER', how = 'inner' )
df.groupBy("CPE_ISSUE") \
    .agg(F.countDistinct("ASSET_ID").alias("count_assetid")).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-------------+
|           CPE_ISSUE|count_assetid|
+--------------------+-------------+
|power_supply/lan_...|          128|
|          no_problem|        16605|
|minor_physical_th...|         1429|
|power_supply/dsl_...|          385|
|reset_itself/rese...|            1|
|telefon_port/powe...|           17|
|telefon_port/inte...|            2|
|                wifi|          187|
|does_not_log_into...|          206|
|reset_itself/dsl_...|            5|
|telefon_port/dsl_...|           16|
|telefon_port/lan_...|           15|
|internet_not_work...|            1|
|            software|            2|
|minor_physical_th...|            1|
|        power_supply|         4134|
|telefon_port/lan_...|            5|
|reset_itself/powe...|            1|
|            dsl_port|         4472|
|            lan_port|          301|
+--------------------+-------------+
only showing top 20 rows

### 4.4  Filtering CPEs with no_problem issue type <a class="anchor" id="sec11"></a>

In [11]:
df = df.filter(df.CPE_ISSUE != 'no_problem')
df.groupBy("CPE_ISSUE") \
    .agg(F.countDistinct("ASSET_ID").alias("count_assetid")).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-------------+
|           CPE_ISSUE|count_assetid|
+--------------------+-------------+
|power_supply/lan_...|          128|
|minor_physical_th...|         1429|
|power_supply/dsl_...|          385|
|reset_itself/rese...|            1|
|telefon_port/powe...|           17|
|telefon_port/inte...|            2|
|                wifi|          187|
|does_not_log_into...|          206|
|reset_itself/dsl_...|            5|
|telefon_port/dsl_...|           16|
|telefon_port/lan_...|           15|
|internet_not_work...|            1|
|            software|            2|
|minor_physical_th...|            1|
|        power_supply|         4134|
|telefon_port/lan_...|            5|
|reset_itself/powe...|            1|
|            dsl_port|         4472|
|            lan_port|          301|
|  device_overheating|          114|
+--------------------+-------------+
only showing top 20 rows

### 4.5  Filtering CPEs with some issues  <a class="anchor" id="sec12"></a>

We are filtering out the CPEs which are actually had any issues except "no_problem"

In [12]:
df_ticket_filtered_dsl = df_ticket_filtered_dsl.join(df,on = 'ASSET_ID', how = 'left_anti')
df_ticket_filtered_dsl.select('ASSET_ID').distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

307036

## 5. Filtering tickets data for DSL lines with all levels and exluding Thunderstorm cases  <a class="anchor" id="sec13"></a>

In [15]:
df_ticket_filtered_dsl_a = df_ticket_filtered_dsl.select('SOURCE','SOLUTION_LEVEL').dropDuplicates() 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
df_ticket_filtered_dsl_p = df_ticket_filtered_dsl_a.toPandas()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [17]:
df_ticket_filtered_dsl_p.groupby('SOURCE')['SOLUTION_LEVEL'].unique()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SOURCE
0915763509"                                                                  [nan]
098705584"                                                                   [nan]
?\t?Korisnik potvrdio da nema opasnosti za zdravlje tehnièara?"              [nan]
BBSA                                                                         [0.0]
DONAT                                                                        [1.0]
Ivana"                                                                       [nan]
KB 385922509978"                                                             [nan]
KB : 385955812189"                                                           [nan]
KB: 0959029975."                                                             [nan]
KB: 385917909536"                                                            [nan]
Korisnik potvrdio da nema opasnosti za zdravlje tehnièara - da"              [nan]
LP"                                                                          [na

**Insights:**
> Possible `Source` and `Solution Level` are:-

> When `Source` : BBSA possible `Solution Level` : 0
>
> When `Source` : DONAT possible `Solution Level` : 1
>
> When `Source` : WWMS possible `Solution Level` : [2,3,4]

<div class="alert alert-block alert-info">
<b>Check:</b> Let's check if same `TICKET_ID` raised multiple number of times for a particular `ASSET_ID`.
</div>

In [50]:
c = df_ticket_filtered_dsl.groupby('ASSET_ID','TICKET_ID').count().filter("'count'>'1'")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [51]:
d = c.filter(F.col('count')>1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [52]:
d.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------+-----+
| ASSET_ID|TICKET_ID|count|
+---------+---------+-----+
| 32059964| 45310913|  359|
| 32722967| 45469880|  420|
| 36648835|443542297|  420|
| 36736504|419549611|  420|
| 36866543| 45939449|  345|
| 45092363| 44552188|  316|
| 46937250| 44713157|  404|
| 49147435| 45753593|  241|
| 49461984| 45141151|  111|
| 54736452| 46229687|  420|
| 56521692| 46720938|  420|
| 66388188| 46658637|  376|
| 77441114| 46525900|  415|
| 94622795|436351803|  417|
| 97674011| 45400918|  419|
|106886969| 47293176|  343|
|109425244| 45670860|  269|
| 35874337| 45620723|  333|
| 38974238| 45734746|  417|
| 44124493|441324294|  420|
+---------+---------+-----+
only showing top 20 rows


<div class="alert alert-block alert-danger">
<b>Alert:</b> It shows a particluar cpes have same tickets raised multiple number of times. So let's get the last ticket raised on same ticketid.
</div>

> Get the dataset for tickets and save it for future analysis

In [62]:
df_tickets = df_ticket_filtered_dsl.select(['ASSET_ID','TICKET_ID','SOURCE','SOLUTION_LEVEL','SUMMARY','TICKET_TENTATIVE_ROOT_CAUSE','START_DATE_TICKET']).dropDuplicates()

#Remove the thunderstorm cases and save the data for further analysis
df_tickets = df_tickets.filter(F.col('TICKET_TENTATIVE_ROOT_CAUSE') != 'Grmljavina - 7')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [68]:
path = '/user/tsystems_vkumar/dsl_tickets/df_tkts_no_thunder.parquet'
df_tickets.repartition(1).write.format('parquet').mode('overwrite').option('header','true').save(path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [33]:
df_ticket_filtered_dsl_f = df_ticket_filtered_dsl.select(['ASSET_ID','TICKET_ID','SOURCE','SOLUTION_LEVEL','SUMMARY','TICKET_TENTATIVE_ROOT_CAUSE','START_DATE_TICKET']).dropDuplicates()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
df_ticket_filtered_dsl_f.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+---------+------+--------------+--------------------+---------------------------+-------------------+
|ASSET_ID|TICKET_ID|SOURCE|SOLUTION_LEVEL|             SUMMARY|TICKET_TENTATIVE_ROOT_CAUSE|  START_DATE_TICKET|
+--------+---------+------+--------------+--------------------+---------------------------+-------------------+
|28955953|447128240| DONAT|             1|       Nema podataka|                       null|               null|
|39066930|424372441| DONAT|             1|       Nema podataka|                       null|2020-06-05 19:46:15|
|42926342|428776832| DONAT|             1|       Nema podataka|                       null|2020-07-27 09:37:50|
|46807364| 46208003|  WWMS|             2|NE PROLAZI BOOT P...|               Blokada - 39|2020-11-11 20:47:06|
|56654705| 45928223|  WWMS|             4|DSL SINHRONIZACIJ...|       Prekid; odvod; kr...|2020-10-04 12:45:53|
+--------+---------+------+--------------+--------------------+---------------------------+-------------

 ### 5.1 Filter all DSL lines tickets irrespective of levels  <a class="anchor" id="sec14"></a>

Now for predicting issues related to DSL lines we need to do some filtering and need to consider all levels.

<div class="alert alert-block alert-success">
<b>Up to you:</b> If we want to get data for any issue type just we need to summary of the function.
</div>


In [12]:
def get_specific_issue_data(df,summary):
    '''This Funtion filter the issue data as per the choice and exclude Thunderstorm cases
    
    Args:
        df (dataframe): Tickets dataset
        summary (str): Issue related to CPE
        
    Returns:
        df (dataframe) :  spark dataframe with filtered data.
    
    '''
    df = df.filter(F.col('SUMMARY').contains(summary))
    df = df.filter(df_ticket_filtered_dsl_f.SOURCE.isin(['BBSA','DONAT','WWMS']))
    df = df.filter(F.col('TICKET_TENTATIVE_ROOT_CAUSE') != 'Grmljavina - 7')
    df = df_ticket_filtered_dsl_f.dropDuplicates()
    
    cols = ['ASSET_ID','TICKET_ID','SOLUTION_LEVEL','START_DATE_TICKET']
    df = df.select(['ASSET_ID','TICKET_ID','SOURCE','SOLUTION_LEVEL','START_DATE_TICKET']).distinct()
    df = df.orderBy(*cols, ascending=False)
    return df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
df_ticket_filtered_dsl_f = get_specific_issue_data(df_ticket_filtered_dsl_f,summary='DSL SINHRONIZACIJ')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [36]:
print(df_ticket_filtered_dsl_f.count())
df_ticket_filtered_dsl_f.select('ASSET_ID').distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

146083
108408

In [37]:
df_ticket_filtered_dsl_f.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+---------+------+--------------+-------------------+
|ASSET_ID|TICKET_ID|SOURCE|SOLUTION_LEVEL|  START_DATE_TICKET|
+--------+---------+------+--------------+-------------------+
|99999823| 44773069|  WWMS|             3|2020-02-17 14:34:16|
|99999615| 45716587|  WWMS|             2|2020-08-25 11:13:06|
|99999432| 44954962|  WWMS|             2|2020-03-24 08:51:38|
|99997653| 45505875|  WWMS|             2|2020-07-11 21:22:39|
|99997653| 45310273|  WWMS|             2|2020-06-03 18:35:17|
+--------+---------+------+--------------+-------------------+
only showing top 5 rows

In [38]:
path = '/user/tsystems_vkumar/dsl_tickets/filtered_cpe_dsl_all_lvl.parquet'
df_ticket_filtered_dsl_f.repartition(1).write.format('parquet').mode('overwrite').option('header','true').save(path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
df_ticket_filtered_dsl_f =  spark.read.option("header","true").parquet('hdfs://nameservicedev1////user/tsystems_vkumar/dsl_tickets/filtered_cpe_dsl_all_lvl.parquet')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

It has been seen that a particular `asset_id` raised issue later also in higher `solution_level` to get fix their DSL lines. So it makes sense to take the highest `solution_level` reached for them.

In [19]:
df_ticket_filtered_dsl_f2 = df_ticket_filtered_dsl_f.groupby('ASSET_ID').agg(F.max('SOURCE').alias('SOURCE')\
                                                            ,F.max('SOLUTION_LEVEL').alias('SOLUTION_LEVEL')\
                                                             ,F.max('START_DATE_TICKET').alias('START_DATE_TICKET')
                                                              )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
df_ticket_filtered_dsl_f2.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- ASSET_ID: string (nullable = true)
 |-- SOURCE: string (nullable = true)
 |-- SOLUTION_LEVEL: integer (nullable = true)
 |-- START_DATE_TICKET: string (nullable = true)

In [6]:
print(df_ticket_filtered_dsl_f2.count())
df_ticket_filtered_dsl_f2.select('ASSET_ID').distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

108408
108408

In [49]:
print('Level 0 tickets:',df_ticket_filtered_dsl_f2.filter(F.col('SOLUTION_LEVEL')==0).count())
print('\n')
print('Level 1 tickets:',df_ticket_filtered_dsl_f2.filter(F.col('SOLUTION_LEVEL')==1).count())
print('\n')
print('Level 2 tickets:',df_ticket_filtered_dsl_f2.filter(F.col('SOLUTION_LEVEL')==2).count())
print('\n')
print('Level 3 tickets:',df_ticket_filtered_dsl_f2.filter(F.col('SOLUTION_LEVEL')==3).count())
print('\n')
print('Level 4 tickets:',df_ticket_filtered_dsl_f2.filter(F.col('SOLUTION_LEVEL')==4).count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

('Level 0 tickets:', 0)


('Level 1 tickets:', 0)


('Level 2 tickets:', 16005)


('Level 3 tickets:', 74571)


('Level 4 tickets:', 17832)

In [20]:
#Check null values in a dataframe
df_ticket_filtered_dsl_f2.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_ticket_filtered_dsl_f2.columns]).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+------+--------------+-----------------+
|ASSET_ID|SOURCE|SOLUTION_LEVEL|START_DATE_TICKET|
+--------+------+--------------+-----------------+
|       0|     0|             0|             1093|
+--------+------+--------------+-----------------+

In [21]:
df_ticket_filtered_dsl_f2 = df_ticket_filtered_dsl_f2.na.drop()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
df_ticket_filtered_dsl_f2 = df_ticket_filtered_dsl_f2.withColumn('START_DATE_TICKET',F.col('START_DATE_TICKET').cast('date'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
df_ticket_filtered_dsl_f2 = df_ticket_filtered_dsl_f2.withColumn('label',F.lit('DSL synchronization'))
df_ticket_filtered_dsl_f2 = df_ticket_filtered_dsl_f2.select(F.col('ASSET_ID').alias('assetid')\
                                                             ,F.col('START_DATE_TICKET').alias('incident_date')\
                                                             ,F.col('label').alias('label')\
                                                             ,F.col('SOLUTION_LEVEL').alias('level'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
df_ticket_filtered_dsl_f2.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------------+-------------------+-----+
|  assetid|incident_date|              label|level|
+---------+-------------+-------------------+-----+
|100237078|   2020-03-25|DSL synchronization|    2|
|100492624|   2020-11-26|DSL synchronization|    3|
|100953000|   2020-09-28|DSL synchronization|    3|
|101647812|   2021-02-21|DSL synchronization|    4|
|102174423|   2021-02-15|DSL synchronization|    3|
+---------+-------------+-------------------+-----+
only showing top 5 rows

In [25]:
path = '/user/tsystems_vkumar/dsl_tickets/dsl_tkts_all_lvls_uniq_n.parquet'
df_ticket_filtered_dsl_f2.repartition(1).write.format('parquet').mode('overwrite').option('header','true').save(path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
df_ticket_filtered_dsl_f2 = spark.read.option("header","true").parquet('hdfs://nameservicedev1///user/tsystems_vkumar/dsl_tickets/dsl_tkts_all_lvls_uniq_n.parquet')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
df_ticket_filtered_dsl_f3 = df_ticket_filtered_dsl_f2.select('assetid','incident_date','label')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
path = '/user/tsystems_vkumar/dsl_tickets/dsl_tkts_all_lvls_uniq_wlbl.parquet'
df_ticket_filtered_dsl_f3.repartition(1).write.format('parquet').mode('overwrite').option('header','true').save(path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 6. Filter Non DSL tickets for healthy dataset <a class="anchor" id="sec15"></a>

Load the created tickets dataset with all issues except Thunderstorm cases.

In [4]:
df_tickets = spark.read.option("header","true").parquet('hdfs://nameservicedev1//user/tsystems_vkumar/dsl_tickets/df_tkts_no_thunder.parquet')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<div class="alert alert-block alert-success">
<b>Conditions:</b> We need to consider Non-DSL tickets which can be included in our healthy dataset to add noise. So after shortlisting below are the non-DSL issues which have been included.
</div>



In [19]:
df_tickets_dth = df_tickets.filter(F.col("SUMMARY").contains("DTH"))
df_tickets_radi = df_tickets.filter(F.col("SUMMARY")=='BEIÈNA VEZA NE RADI')
df_tickets_a = df_tickets.filter(F.col("SUMMARY")=='STALNO ZAUZET')
df_tickets_b = df_tickets.filter(F.col("SUMMARY")=='NE RADE DOLAZNI NI ODLAZNI POZIVI (VOIP)')
df_tickets_c = df_tickets.filter(F.col("SUMMARY")=='DODATNE USLUGE')
df_tickets_d = df_tickets.filter(F.col("SUMMARY")=='2. STB SE NE BOOTA')
df_tickets_e = df_tickets.filter(F.col("SUMMARY")=='NEISPRAVAN STB')
df_tickets_f = df_tickets.filter(F.col("SUMMARY")=='STB SE POVREMENO GASI')
df_tickets_g = df_tickets.filter(F.col("SUMMARY")=='NEISPRAVAN DALJINSKI UPRAVLJAÈ')
df_tickets_h = df_tickets.filter(F.col("SUMMARY")=='SMETNJA NA VIDEOTECI')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
df_tkts_comb = df_tickets_dth.union(df_tickets_radi).union(df_tickets_a).union(df_tickets_c)\
                             .union(df_tickets_d).union(df_tickets_e).union(df_tickets_f).union(df_tickets_g)\
                                .union(df_tickets_h)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
df_tkts_comb.select(F.col('SUMMARY')).distinct().toPandas().SUMMARY.unique()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

array([u'2. STB SE NE BOOTA', u'DTH - NEMA ZEMALJSKIH KANALA',
       u'DTH - NEUSPJE\x8aNO UPARIVANJE', u'NEISPRAVAN STB',
       u'DTH - novi transponder', u'DTH - NEISPRAVNA DEKODERSKA KARTICA',
       u'DTH - KANAL KODIRAN', u'STALNO ZAUZET',
       u'BE\x8eI\xc8NA VEZA NE RADI',
       u'DTH - POVREMENO SE GUBI SATELITSKI SIGNAL',
       u'SMETNJA NA VIDEOTECI', u'NEISPRAVAN DALJINSKI UPRAVLJA\xc8',
       u'DTH - NEMA SATELITSKOG SIGNALA', u'DODATNE USLUGE',
       u'STB SE POVREMENO GASI', u'DTH - O\x8aTE\xc6EN ANTENSKI SUSTAV'],
      dtype=object)

In [6]:
df_tkts_comb = df_tkts_comb.dropDuplicates()
print(df_tkts_comb.count())
df_tkts_comb.select(F.col('ASSET_ID')).distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

53754
46783

In [26]:
df_tkts_comb = spark.read.option("header","true").parquet('hdfs://nameservicedev1///user/tsystems_vkumar/dsl_tickets/df_tkts_non_dsl.parquet')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We need to take the lastest issue recorded for all the filtered issue type

In [27]:
df_tkts_comb_agg_1 = df_tkts_comb.groupby('ASSET_ID','SUMMARY').agg(F.max('START_DATE_TICKET').alias('start_date'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
print(df_tkts_comb_agg_1.count())
df_tkts_comb_agg_1.select(F.col('ASSET_ID')).distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

49896
46783

In [28]:
#Check null count
df_tkts_comb_agg_1.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_tkts_comb_agg_1.columns]).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-------+----------+
|ASSET_ID|SUMMARY|start_date|
+--------+-------+----------+
|       0|      0|       591|
+--------+-------+----------+

In [29]:
df_tkts_comb_agg_1 = df_tkts_comb_agg_1.na.drop()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [30]:
df_tkts_comb_agg = df_tkts_comb_agg_1.select(F.col('ASSET_ID').alias('assetid'),F.col('start_date').alias('incident_date'),F.col('SUMMARY').alias('label'))
df_tkts_comb_agg.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------------------+--------------------+
|  assetid|      incident_date|               label|
+---------+-------------------+--------------------+
|100424784|2021-03-26 16:03:55|       STALNO ZAUZET|
|101866361|2020-04-04 20:13:56|       STALNO ZAUZET|
|102594887|2020-09-22 14:33:38|       STALNO ZAUZET|
|103232787|2020-11-27 08:09:41|NEISPRAVAN DALJIN...|
|103318378|2021-01-18 17:31:26|BEIÈNA VEZA NE RADI|
+---------+-------------------+--------------------+
only showing top 5 rows

In [31]:
df_tkts_comb_agg_1 = df_tkts_comb_agg.drop('label')
df_tkts_comb_agg_1 = df_tkts_comb_agg.withColumn('label',F.lit('Non-DSL issue'))
df_tkts_comb_agg_1 = df_tkts_comb_agg_1.dropDuplicates()
df_tkts_comb_agg_1.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-------------------+-------------+
| assetid|      incident_date|        label|
+--------+-------------------+-------------+
|70906512|2020-08-03 12:21:34|Non-DSL issue|
|52318977|2020-01-07 19:09:22|Non-DSL issue|
|70665323|2020-05-15 18:51:19|Non-DSL issue|
|49008735|2020-01-18 14:25:24|Non-DSL issue|
|26966526|2020-01-16 10:20:19|Non-DSL issue|
+--------+-------------------+-------------+
only showing top 5 rows

In [34]:
df_tkts_comb_agg_1 = df_tkts_comb_agg_1.withColumn('incident_date',F.col('incident_date').cast('date'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [33]:
print(df_tkts_comb_agg_1.count())
df_tkts_comb_agg_1.select(F.col('assetid')).distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

49304
46257

In [35]:
path = '/user/tsystems_vkumar/non_dsl_tickets/non_dsl_tkts_n.parquet'
df_tkts_comb_agg_1.repartition(1).write.format('parquet').mode('overwrite').option('header','true').save(path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Not required  but it's for translation

In [21]:
mapping = {'NEISPRAVAN MODEM - TEHNIÈAR':'DEFECTIVE MODEM - TECHNICIAN',
'OPTIKA - NEISPRAVAN ONT':'OPTICS - DEFECTIVE ONT',
'DSL SINHRONIZIRAN: POVREMENI PREKIDI KONEKCIJE':'DSL SYNCHRONIZED, OCCASIONAL INTERRUPTIONS',
'NEMA TEL. SIGNALA':'NO TEL. SIGNAL',
'NEISPRAVAN MODEM':'DEFECTIVE MODEM',
'STALNO ZAUZET':'CONSTANTLY OCCUPIED',
'NE PROLAZI BOOT PROCEDURA':'DOES NOT PASS THE BOOT PROCEDURE',
'MAXNET MINI':'MAXNET MINI',
'KORISNIÈKA OPREMA':'USER EQUIPMENT',
'DSL SINHRONIZACIJA - PREKID':'DSL SYNCHRONIZATION - INTERRUPTION',
'POTEKOÆA S BRZINOM':'DIFFICULTY WITH SPEED',
'NE RADE DOLAZNI NI ODLAZNI POZIVI (VOIP)':'INBOUT OR OUTFLOW CALLS (VOIP) DO NOT WORK',
'DSL SINHRONIZACIJA - POVREMENI PREKIDI':'DSL SYNCHRONIZATION - OCCASIONAL INTERRUPTIONS',
'DTH - POVREMENO SE GUBI SATELITSKI SIGNAL':'DTH - OCCASIONALLY LOSS SATELLITE SIGNAL',
'NEMOGUÆE UTVRDITI':'IMPOSSIBLE TO DETERMINE',
'BEIÈNA VEZA NE RADI':'WIRELESS DOES NOT WORK',
'MODEM NEISPRAVAN (GRMLJAVINA)':'MODEM FAULT (THUNDERSTORM)',
'NEISPRAVAN MODEM (GRMLJAVINA)':'DEFECTIVE MODEM (THUNDERSTORM)',
'NE RADE DOLAZNI POZIVI':'INCOMING CALLS DO NOT WORK',
'MODEM NEISPRAVAN (GRMLJAVINA) - TEHNIÈAR':'MODEM FAULT (THUNDERSTORM) - TECHNICIAN',
'DODATNE USLUGE':'ADDITIONAL SERVICES',
'DTH - NEMA SATELITSKOG SIGNALA':'DTH - NO SATELLITE SIGNAL',
'NE RADE DOLAZNI NI ODLAZNI POZIVI':'INCOMING OR OUTGOING CALLS DO NOT WORK',
'NEISPRAVAN IP TELEFON':'DEFECTIVE IP PHONE',
'DTH - NEUSPJENO UPARIVANJE':'DTH - PAIR FAILURE',
'NEISPRAVAN MODEM (VoIP) - TEHNIÈAR':'DEFECTIVE MODEM (VoIP) - TECHNICIAN',
'2. STB SE NE BOOTA':'2. STB SE NE BOOTA',
'NEISPRAVAN STB':'DEFECTIVE STB',
'NEISPRAVAN MODEM (GRMLJAVINA) - TEHNIÈAR':'DEFECTIVE MODEM (THUNDERSTORM) - TECHNICIAN',
'STB SE POVREMENO GASI':'STB IS OCCASIONALLY EXTINGUISHED',
'USER SPOJEN: NE OTVARA STRANICE':'USER CONNECTED, DOES NOT OPEN PAGES',
'ZAMRZAVANJE SLIKE':'PICTURE FREEZING',
'NEISPRAVAN DALJINSKI UPRAVLJAÈ':'DEFECTIVE REMOTE CONTROL',
'NE RADI SWITCH':'SWITCH DOES NOT WORK',
'OPTIKA - POVREMENI PREKIDI /NEMA OPTIÈKOG LINKA':'OPTICS - OCCASIONAL INTERRUPTIONS / NO OPTICAL LINK',
'SMETNJA NA SNIMALICI':'INTERFERENCE ON THE RECORDER',
'DSL SINHRONIZIRAN: USER SE NE SPAJA':'DSL SYNCHRONIZED, USER DOES NOT CONNECT',
'NE VIDE SE POJEDINI KANALI':'INDIVIDUAL CHANNELS CANNOT BE SEEN',
'OPTIKA - NEMA AKTIVACIJE MODEMA':'OPTICS - NO MODEM ACTIVATION',
'NEISPRAVAN STB (GRMLJAVINA)':'DEFECTIVE STB (THUNDERSTORM)',
'NEMA ZVUKA':'NO SOUND',
'DTH - NEISPRAVNA DEKODERSKA KARTICA':'DTH - DEFECTIVE DECODER CARD',
'GREKA NA KUÆNOJ INSTALACIJI':'HOUSEHOLD INSTALLATION ERROR',
'NEISPRAVAN MODEM (VoIP)':'FAULT MODEM (VoIP)',
'NE RADE SAMO DOLAZNI POZIVI':'ONLY INCOMING CALLS DO NOT WORK',
'NEISPRAVAN MEDIATRIX':'DEFECTIVE MEDIATRIX',
'PROBLEMI S DALJINSKIM UPRAVLJAÈEM':'REMOTE CONTROL PROBLEMS',
'OPTIKA - OTEÆENA KUÆNA OPTIÈKA INSTALACIJA':'OPTICS - DAMAGED HOME OPTICAL INSTALLATION',
'OPTIKA - NEISPRAVAN ONT (GRMLJAVINA)':'OPTICS - DEFECTIVE ONT (THUNDER)',
'IÈNA VEZA MODEM  - PC NE RADI':'WIRE CONNECTION MODEM - PC DOES NOT WORK',
'TRZANJE SLIKE':'PICTURE PICTURE',
'SLABA ÈUJNOST':'POOR HEARING',
'NE RADE SAMO ODLAZNI POZIVI':'ONLY OUTSTANDING CALLS DO NOT WORK',
'DTH - novi transponder':'DTH - new transponder',
'Prekid HA usluge':'Termination of HA service',
'Prekid usluga i prekid 4G backup-a':'Interruption of services and termination of 4G backup',
'PREKID VEZE KOD JAVLJANJA':'DISCONNECTION AT THE REPORT',
'DTH - OTEÆEN ANTENSKI SUSTAV':'DTH - DAMAGED ANTENNA SYSTEM',
'SMETNJE U TOKU VEZE (UM)':'INTERFERENCE DURING CONNECTION (NOISE)',
'NE RADI - JTG':'NOT WORKING - JTG',
'STALNO ZVONI':'CONSTANTLY RINGS',
'SMETNJA NA USLUZI SIGURAN DOM':'INTERFERENCE WITH SAFE HOME SERVICE',
'OPTIKA - IMA LINKA - NEMA AKTIVACIJE ONT-a':'OPTIKA - THERE IS A LINK - NO ONT ACTIVATION',
'SMETNJE U TOKU VEZE (JEKA:KANJENJE GLASA)':'INTERFERENCE DURING THE COMMUNICATION (Echo, VOICE DELAY)',
'TeraStream ? prekid':'TeraStream? break',
'DTH - NEMA ZEMALJSKIH KANALA':'DTH - NO EARTH CHANNELS',
'SMETNJA NA VIDEOTECI':'INTERFERENCE AT THE VIDEO LIBRARY',
'DTH - KANAL KODIRAN':'DTH - CHANNEL CHANNEL',
'PRESLUAVANJE U TOKU RAZGOVORA':'LISTENING DURING THE INTERVIEW',
'ZAMJENA BROJEVA':'NUMBER REPLACEMENT',
'NE RADE ODLAZNI POZIVI':'DETAILED CALLS DO NOT WORK',
'Multi Office - POTEKOÆA KOD PRISTUPA LOKACIJAMA':'Multi Office - DIFFICULTIES IN ACCESSING LOCATIONS',
'JEDNOSTRANA ÈUJNOST':'UNILATERAL HEARING',
'NE RADI INTERNET (DIAL - UP)':'THE INTERNET DOES NOT WORK (DIAL - UP)',
'MaxTv2Go':'MaxTv2Go',
'HotSpotFon':'HotSpotFon',
'NE RADE DODATNE OPCIJE (TXT: WIDGET: EPG)':'ADDITIONAL OPTIONS (TXT, WIDGET, EPG) DO NOT WORK',
'POS APARAT NE RADI':'POS APPLIANCE DOES NOT WORK',
'NEISPRAVAN MODEM - TCENTAR':'DEFECTIVE MODEM - TCENTAR',
'Nema podataka':'No data',
'MODEM NEISPRAVAN (GRMLJAVINA) - DISTRIBUTER':'MODEM DEFECTIVE (THUNDERSTORM) - DISTRIBUTOR',
'NEMA REGISTRACIJE BROJA':'NO NUMBER REGISTRATION',
'NEISPRAVAN MODEM - DISTRIBUTER':'DEFECTIVE MODEM - DISTRIBUTOR',
'MODEM NEISPRAVAN (GRMLJAVINA) - TCENTAR':'MODEM DEFECTIVE (THUNDERSTORM) - TCENTAR',
'SPLITER NEISPRAVAN':'SPLITTER DEFECTIVE',
'NEISPRAVAN MODEM (GRMLJAVINA) - DISTRIBUTER':'DEFECTIVE MODEM (THUNDERSTORM) - DISTRIBUTOR',
'NEISPRAVAN MODEM (VoIP) - TCENTAR':'FAULT MODEM (VoIP) - TCENTAR',
'NEISPRAVAN MODEM (GRMLJAVINA) - TCENTAR':'DEFECTIVE MODEM (THUNDERSTORM) - TCENTAR',
'NE RADI FAX':'FAX WORKS',
'Terastream ? neispravan modem':'Terastream? faulty modem',
'NE RADI POS APARAT':'POS APARTMENT DOES NOT WORK',
'Terastream ?  degradacija':'Terastream? degradation',
'TeraStream - povremeni prekidi':'TeraStream - intermittent interruptions',
'NEISPRAVAN MODEM (VoIP) - DISTRIBUTER':'DEFECTIVE MODEM (VoIP) - DISTRIBUTOR'}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
df_tkts_comb_agg_3 = df_tkts_comb_agg_2.replace(to_replace=mapping,subset=['LABEL'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
df_tkts_comb_agg_3.select(F.col('LABEL')).distinct().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|               LABEL|
+--------------------+
|  2. STB SE NE BOOTA|
|DTH - OCCASIONALL...|
|DEFECTIVE REMOTE ...|
| ADDITIONAL SERVICES|
|DTH - NO EARTH CH...|
| CONSTANTLY OCCUPIED|
|WIRELESS DOES NOT...|
|STB IS OCCASIONAL...|
|DTH - DAMAGED ANT...|
|  DTH - PAIR FAILURE|
|DTH - CHANNEL CHA...|
|DTH - NO SATELLIT...|
|INTERFERENCE AT T...|
|DTH - new transpo...|
|DTH - DEFECTIVE D...|
|       DEFECTIVE STB|
+--------------------+

In [24]:
df_tkts_comb_agg_3.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-------------------+--------------------+
| ASSET_ID|         START_DATE|               LABEL|
+---------+-------------------+--------------------+
|100424784|2021-03-26 16:03:55| CONSTANTLY OCCUPIED|
|101866361|2020-04-04 20:13:56| CONSTANTLY OCCUPIED|
|102594887|2020-09-22 14:33:38| CONSTANTLY OCCUPIED|
|103232787|2020-11-27 08:09:41|DEFECTIVE REMOTE ...|
|103318378|2021-01-18 17:31:26|WIRELESS DOES NOT...|
+---------+-------------------+--------------------+
only showing top 5 rows

In [25]:
path = '/user/tsystems_vkumar/dsl_tickets/df_tkts_non_dsl_sel_final.parquet'
df_tkts_comb_agg_3.repartition(1).write.format('parquet').mode('overwrite').option('header','true').save(path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…