Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【有奖征文】DSS在程序化广告中应用实践 #13

Open
urzeric opened this issue Nov 13, 2020 · 0 comments
Open

【有奖征文】DSS在程序化广告中应用实践 #13

urzeric opened this issue Nov 13, 2020 · 0 comments

Comments

@urzeric
Copy link

urzeric commented Nov 13, 2020

一. 应用场景

珑玺科技的大数据管理平台DMP依托于Hadoop的下的HiveSpark等工具链展开, 之前的相关任务都是通过shell脚本的通过定时任务开展, 随着业务需求越来越复杂, 所沉淀的脚本越来越多, 变得难以维护, 增加了后续迭代和项目成员之间的沟通成本. 这时候我们看到微众刚刚开源的这个项目, 可以利用整个平台完成大数据的以下几个工作流:

  • 数据交换
  • 数据开发
  • 数据质量
  • 数据可视化
  • 数据发送

完成从数据的收集, 去重, 清洗, 规整等工作流, 使整个数据治理中的几乎所有工作, 可以通过 DSS 得到顺利流转, 提升了大数据的数据管理和分析的体验.

我司的DMP平台主要基于Hadoop 2.6来做集成的, 主要功能集中在元数据处理, 数据的去重,清洗, 和标准化. 实现平台的OLAP的工作流, 最终实现的业务需求, 主要集中在4个方面:

  • 元数据管理
  • 用户标签沉淀
  • 反作弊数据分析
  • 相关BI报表输出

二. 解决的问题

多数据源支持

平台支持主要的数据源有MySQL, Hive, 甚至支持最新的NewSQL平台分布式数据库TiDB, 和其他第三方数据数据格式. 使用此平台前, 需要做频繁的数据转换操作, 上下文切换成本比较高; 通过引入此系统之后, 可以通过平台集成的数据交换模块, 非常平滑的引入各种数据源进行交叉分析, 提升了多数据源管理的处理效率.

数据脚本开发

Picture1

DSS部署之前平台的各种任务都是编写shell脚本, 来实现对这个大数据看分析的流程, 随着业务的迭代, 和需求的增多, 脚本的可维护性变得越来越差, 而通过数据开发Scripts模块, 完全兼容hql, MySQL, PySpark, 几乎可以重用之前的大部分脚本, 而且执行过程和结果可视化, 提升了数据分析的工作效率.

Picture2

三. 最佳实践

阿里云OSS数据读取的问题

我们大部分数据都存储在阿里云对象存储OSS中, 所以需要另外配置读取OSSjar文件
同步阿里云OSSjars包到linkislib目录下面

附: CDH集成阿里云OSS说明

cd /opt/linkis
find . -name "lib" -print | awk '{printf("cp /home/hadoop/aliyun-oss-jars/*.jar /opt/linkis/%s/\n", $1)}' |sh
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/hadoop-oss-cdh-5.14.4/httpclient-4.5.2.jar /opt/linkis/%s/\n", $1)}'
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/hadoop-oss-cdh-5.14.4/httpcore-4.4.4.jar /opt/linkis/%s/\n", $1)}'
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/jdom-1.1.jar /opt/linkis/%s/\n", $1)}'
$ cd aliyun-oss-jars/
$ ls -l
total 2932
-rw-r--r-- 1 hadoop hadoop 116337 Jan 2 10:59 aliyun-java-sdk-core-3.4.0.jar
-rw-r--r-- 1 hadoop hadoop 788137 Jan 2 10:59 aliyun-java-sdk-ecs-4.2.0.jar
-rw-r--r-- 1 hadoop hadoop 215492 Jan 2 10:59 aliyun-java-sdk-ram-3.0.0.jar
-rw-r--r-- 1 hadoop hadoop 13277 Jan 2 10:59 aliyun-java-sdk-sts-3.0.0.jar
-rw-r--r-- 1 hadoop hadoop 562719 Jan 2 10:59 aliyun-sdk-oss-3.4.1.jar
-rw-r--r-- 1 hadoop hadoop 71074 Jan 2 15:12 hadoop-aliyun-2.6.0-cdh5.14.4.jar
-rw-r--r-- 1 hadoop hadoop 736658 Jan 2 15:10 httpclient-4.5.2.jar
-rw-r--r-- 1 hadoop hadoop 326724 Jan 2 15:10 httpcore-4.4.4.jar
-rw-r--r-- 1 hadoop hadoop 153115 Jan 2 15:10 jdom-1.1.jar

同步阿里云的库到DSS:

cd /opt/aliyun-oss-jars/
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-spark-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-hive-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-spark-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-resourcemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/eureka/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-jdbc-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/modulebak/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/module/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-publicservice/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-bml/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-python-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-python-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-gateway/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-hive-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-metadata/lib/%s\n", $2, $2)}' | sh
cd /opt/linkis
find . -name "lib" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/hadoop-aliyun-2.6.0-cdh5.14.4.jar /opt/linkis/%s/lib/hadoop-aliyun.jar\n", $2)}' 

Scriptis的右侧刷不出来数据,一直在刷新中

步骤1: 修改文件

vim /home/hadoop//Linkis20191218/metadata/target/classes/com/webank/wedatasphere/linkis/metadata/hive/dao/impl/HiveMetaDao.xml
<select id="getDbsByUser" resultType="java.lang.String" parameterType="java.lang.String">
        <!--select NAME from(
        select t2.NAME  as NAME
        from DB_PRIVS t1, DBS t2
        where (lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR}
        OR t1.PRINCIPAL_NAME IN (SELECT ROLE FROM(SELECT r.ROLE_NAME AS ROLE, u.PRINCIPAL_NAME AS USER FROM ROLES r LEFT JOIN (SELECT * FROM ROLE_MAP WHERE PRINCIPAL_TYPE = 'USER') u ON r.ROLE_ID = u.ROLE_ID)AS T where T.USER = #{userName,jdbcType=VARCHAR}))
        and lcase(t1.DB_PRIV) in ('select','all') and t1.DB_ID =t2.DB_ID
        union all
        select t3.NAME as NAME
        from TBL_PRIVS t1, TBLS t2 , DBS t3
        where t1.TBL_ID=t2.TBL_ID and lcase(t1.TBL_PRIV) in ('select','all') and (
        lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR} or lcase(t1.PRINCIPAL_NAME) in (SELECT ROLE FROM(SELECT r.ROLE_NAME AS ROLE, u.PRINCIPAL_NAME AS USER FROM ROLES r LEFT JOIN (SELECT * FROM ROLE_MAP WHERE PRINCIPAL_TYPE = 'USER') u ON r.ROLE_ID = u.ROLE_ID)AS T where T.USER = #{userName,jdbcType=VARCHAR}))
        and t2.DB_ID=t3.DB_ID) a
        GROUP BY NAME
        order by NAME-->
        select name from DBS
    </select>

    <select id="getTablesByDbNameAndUser" resultType="map"  parameterType="map">
        <!--select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from DB_PRIVS t1,TBLS t2, DBS t3
        where  t1.DB_ID =t3.DB_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.DB_PRIV) in ('select','all')
        and lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR}
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        union
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from DB_PRIVS t1,TBLS t2, DBS t3
        where  t1.DB_ID =t3.DB_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.DB_PRIV) in ('select','all')
        and lcase(t1.PRINCIPAL_NAME) in (select ROLE_NAME from ROLES where ROLE_ID in (select ROLE_ID from ROLE_MAP where PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}))
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        union
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from TBL_PRIVS t1, TBLS t2 , DBS t3
        where t1.TBL_ID=t2.TBL_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.TBL_PRIV) in ('select','all')
        and t1.PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        union
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from TBL_PRIVS t1, TBLS t2 , DBS t3
        where t1.TBL_ID=t2.TBL_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.TBL_PRIV) in ('select','all')
        and t1.PRINCIPAL_NAME in (select ROLE_NAME from ROLES where ROLE_ID in (select ROLE_ID from ROLE_MAP where PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}))
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        order by NAME;-->
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from TBLS t2 , DBS t3
        where 
         t2.DB_ID=t3.DB_ID
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
    </select>

步骤2: 进入Linkis-20191218/metadata 重新编译

root@cdh04:/home/hadoop/Linkis-20191218/metadata# mvn clean 
root@cdh04:/home/hadoop/Linkis-20191218/metadata# mvn install

步骤3: 确认相应的包已经更新

cp linkis-metadata-0.9.2.jar /opt/linkis/linkis-metadata/lib/linkis-metadata-0.9.2.jar 

步骤4: 重启所有linkis-metadata微服务

错误码: errCode: 10905

错误日志

[2019-12-24 22:47:39.120 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (63) [around] - 调用方法:addRootFlow
2019-12-24 22:47:39.120 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (72) [around] - projectVersionID为:1
2019-12-24 22:47:39.135 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (80) [around] - 执行过程出现异常 
com.webank.wedatasphere.linkis.httpclient.exception.HttpClientResultException: errCode: 10905 ,
		desc: URL http://127.0.0.1:9001/api/rest_j/v1/bml/upload request failed! ResponseBody is 
		{"timestamp":"2019-12-24T14:47:39.132+0000","status":404,"error":"Not Found","message":"/api/rest_j/v1/bml/upload","path":"/api/rest_j/v1/bml/upload"}. ,
		ip: cdh04 ,port: 9004 ,serviceKind: dss-server
		]

解决方案:
到部署目录linkis删除下面所有微服务的包

find . -name "jasper-*" -print | awk  '{printf("mv %s %s_del\n", $1,$1)}' |sh

软件版本:

  • Hadoop 2.6.0
  • MySQL 5.7.26
  • Hive 1.1.0
  • HBase 1.2.0
  • Spark2 2.2.0

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant